Thursday, 19 December 2013

FFTW Benchmarks on Cortex-A7

The FFT algorithm has many scientific uses. The most obvious uses are in radio astronomy, for the frequency analysis of signals and is vital to Software Defined Radio (SDR) which is used extensively in the Square Kilometer Array (SKA). In line with the goals of the MAC Project, I am curious about how well an ARM processor (specifically the Cortex-A7) can do FFT - which leads to these benchmarks.

I discovered some existing benchmarks of FFTW done on the Cortex-A8 and A9 by Vesperix here. I used their modified FFTW 3.2.2 for ARM NEON and also ran benchmarks using the latest official version of FFTW: 3.3.3. Both sets of results are presented below with a short discussion afterwards.

I was unable to get the FFTW 3.3.3 NEON version working. I was repeatedly hit by a segmentation fault which I think is due to different memory alignment in the newer NEON and VPPv4 FPU's. I will post these specific benchmarks when the error is resolved.

System Specifications

The tests were run on a Cubieboard2 with the following specifications:
  • Allwinner A20 Dual-Core Cortex-A7 SoC @ ~1GHz
  • VFPv4 and NEONv2 FPU
  • 256 kB L2 Cache
  • 1 GB DDRIII RAM
  • 8GB Class 10 MicroSD Card
  • sunxi kernel 3.4.67+
  • Linaro 13.04 (with GCC 4.7.3)

Benchmark Methodology

I am only presenting the results for a complex 1D FFT with powers of two and non-powers of two. These are the types of FFTs that are most useful to radio astronomy since signal phase and amplitude are represented as a complex number. I ran several sets of benchmarks with various optimisations for comparison, each of which I will describe below.

I first tested the Vesperix FFTW 3.2.2 and then the FFTW 3.3.3. In all cases I used the following configure flags for single precision and the only available timer on the ARM processor:

--enable-single --with-slow-timer

I ran non-SIMD (no NEON) test without any extra flags, and NEON SIMD tests with the flag below:

--enable-neon

I also tried out the fused multiply-add flag since the Cortex-A7 has this instruction in the VFPv4 FPU but I found that this flag actually caused performance to decrease! A short description of why this is can be found in the FFTW 3.3.3 tests section.

--enable-fma

In all cases I modified the configure script to optimise for the CPU with the '-mcpu=cortex-a7' flag. I also modified the configure script to try out different GCC FPU options where appropriate, but in general I am only presenting the fastest results in this post. The options I tried are listed below for reference:

-mfpu=neon
-mfpu=neon-vfpv4
-mfpu=vfpv4-d16
-mfpu=vfpv3-d16

I repeated the tests with NEON on a threaded version of FFTW to see at which point multiple threads (on multiple cores) makes a difference and by how much. To enable the threaded version, FFTW must be recompiled with the flag below. Note that FFTW can be compiled once with this flag and used in both a threaded or unthreaded way.

--enable-threads

I plan on running an MPI version with more threads (4 to 16) on our Cubieboard and Wandboard clusters at a later stage.

I used the script provided by Vesperix to automate the benchmarks. For the threaded tests I modified the script to contain the '-onthreads=2'. The number can be adjusted to suit the number of cores available on the system. The modified script is shown below.

#!/bin/sh
for TYPE in 'c'; do
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (2 threads)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench -onthreads=2 $OPTS ${PLACE}${TYPE}${SIZE}
    done
  done
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (1 thread)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench $OPTS ${PLACE}${TYPE}${SIZE}
    done
  done
done

MFLOPS Result Interpretation:

The result provided by FFTW is 'MFLOPS': this is not true MFLOPS. It is estimated by FFTW based on an assumption of algorithmic complexity for the standard Cooley-Tukey FFT algorithm:

Although not necessarily totally accurate in the classic FLOPS sense, it is calculated the same way in all cases, so it works as a way to compare between runs. For comparison to other algorithms, I would rather use the actual time the algorithm takes to run on a specific FFT size (N).

FFTW 3.2.2 (Vesperix):

Please examine the various graphs below. Clearly, NEON makes quite a large difference and is a 'no-brainer' for any application. I am showing one set of non-power of two benchmarks to illustrate why they should not be used.



I ran a test of the threaded version of FFTW 3.2.2 and the results are promising for a scaled-up system. The Cubieboard2 is only a dual-core system but I plan on running MPI tests with more cores at a future date.



FFTW 3.3.3 (Official):

I was unable to get the NEON version of FFTW 3.3.3 working. I was able to run benchmarks of the scalar version of the code which shows a performance improvement over the 3.2.2 scalar results. I compiled one graph comparing all the different scalar versions, with FMA instructions and without.



Note how the FMA versions have slightly lower performance. In the Benchmark Methodology section I mentioned that the --enable-fma flag actually causes performance to decrease. The reason for this is not intuitive as one would think that a Fused Multiply Add (FMA) instruction would save cycles as it replaces separate Multiply and Add instructions. In the computation of an FFT, two of the common operations are:

t0 = a + b * c
t1 = a - b * c

The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA:

t0 = a
t0 += b * c
t1 = a
t1 -= b * c

Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:

t = b * c
t0 = a + t
t1 = a - t

All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference.

Conclusion

The results from this set of benchmarks are very similar to those attained by Vesperix on Cortex-A9 boards. The multi-threaded version is also significantly better for larger FFT sizes. The results at different FFT sizes are very dependant on the processor implementation details such as cache sizes and memory access times. With smaller FFT's the overhead associated with calculating the FFT is a large factor and this is clearly visible up to sizes of 128.

The scalar results for FFTW 3.3.3 are better than those from 3.2.2 so it is logical to assume that the newer version's NEON performance will be better as well.

Since FFTW works by creating a 'plan' before actually calculating the FFT, it chooses to not use more than one thread in the multi-threaded version before a certain FFT size. This is clearly visible as it chooses to use multi-threading at greater than size 128. The overhead associated with doing this causes the result to be poor at size 256 and based on the results, multi threading should only be enabled for sizes over 1024.

The power usage of the Cortex-A7 processor is lower than that of the Cortex-A9 and so if a large cluster of these devices is used for a computational task such as radio astronomy, one could speculate that it may be worthwhile to use more Cortex-A7's over fewer Cortex-A9's since the performance is similar. 


Friday, 6 December 2013

Current Measurement Board

Since we are interested in power measurements for the different ARM platforms, I decided to quickly design and build a simple current measurement board that we can connect an oscilloscope to to plot current (and hence calculate power with a corresponding voltage measurement on the second channel).

The concept is based on Ohms law: the voltage across a resistor is equal to the resistance multiplied by the current through it. The board design in the schematic caters for a known 0.01 Ohm resistor with a 1% tolerance and a gain of 100 with a 1500 Hz low pass filter. The gain of 100 results in a voltage output that is proportional to the current: 1 A current gives 1 V output. This is so that we can use the oscilloscopes built in multiplication to see power in real-time. The low pass filter is there so that we don't get too much noise on the current measurement, but still enough response to see a spike when we start benchmarks.

I have posted images of the schematic and photos of the finished board. If you would like the Cadsoft Eagle design files, I'm happy to share them - just put a request in the comments below. Something I should also mention is the op-amp is pretty high end, unnecessarily! To be honest, it is a free sample that I gratefully received from Maxim, so I used it...

Specifications:
Input Current: ~50 mA - 5 A
Output Voltage: ~50 mV - 5 V (dependant on how close to the negative / ground rail the op-amp can go)
Frequency Response: -3 dB @ 1591 Hz
Supply Voltage: 5 V (dependant on the op-amp)
Power Loss in 0.01 Ohm Resistor @ 5 A: 0.25 W (one can use a spreadsheet to compensate for this error)







Wednesday, 4 December 2013

Wandboard PCI-Express Adapter: Preliminary PCB 3D Images

I have been designing an adapter board to connect two Wandboards via their PCI-Express ports. The design will be finished by the end of this week, after which it will be sent in for manufacturing! This post is simply to show off two quick 3D renderings of the boards so far.

We will use this board for testing and benchmarking the potential throughput of the Freescale i.MX6 processor. The gigabit ethernet on the i.MX6 is limited to ~400 Mbps according to the datasheet, but we should be able to attain close to 5 Gbps with extremely low latency using the PCI-Express Gen 2 x1 port!

More details will follow when the board has been manufactured and tests are complete!




Saturday, 30 November 2013

Complete Wandboard Array

Following the previous posts regarding the installation of HPL on the Wandboard and Cubieboard2 and the subsequent setup of the two Cubieboards connected and running HPL I am pleased to share that we have set up five Wandboards running Ubuntu 13.05 Server (Thanks to Martin Wild) and using MPICH2 as the MPI.

If you would like details on how to set up multiple boards please view my post of setting up the Cubieboard2 "array" here.

Getting things ready


The Wandboards arrived with out power adapters. We decided to build our own using a normal PC 300W power supply as this provides proper grounding and if there are static discharges against the boards they will be better protected.

Here is the first power connector we made. The green strip is a small two channel PCB board.
 The cable is standard two core cabling with a plug at the end which fits the Wandboard sockets. I used extra long spacer screws so that we could stack the boards on top of each other. They had to be spaced wide enough that the heat wouldn't be an issue and so that we could get fingers at each board in case we need to add hard drives.

Stacked array of the Wandboards

Now before connecting the power I had to write pre-made images to the sd card for each board Once I had one board up and running I then copied the SD card to the remaining four and set each IP address. There were a few issues with this which ill speak about at the end. Once the board were all up I connected everything together.

Completed Array


Wandboard array with power and Ethernet
 Now following similar procedures as my previous posts I set up HPL over a shared drive using NFS and I configured the HPL for neon and hardfp. I ran a quick test on the array using a small problem to test if all boards would indeed respond correctly. I was happy to see that all five boards showed xhpl in the processes (top) when I ran HPL.

Five terminals showing active processes when running HPL

Next up


I tried to compile ATLAS for the Cubieboard2 using neon-vfpv4 but the compiling got stuck at the L1 cache due to an infinity popping up somewhere. I will recompile that using just neon and do something similar for the Wandboard. This will improve performance quite a lot as I am using a standard ATLAS library at the moment. Once that is done I will be able to start tuning the HPL.dat file for the array.

Problems Encountered


An interesting problem came up when I copied the OS from one SD card to the other. During boot it would take exceptionally long and after finally starting up there would be no Ethernet. I checked for the adapters using ifconfig -a and they were named eth1 or eth2... not the default eth0. After doing some investigating it was quite obvious... When Linux boots up it searches for the devices and saves them in the following file:

/etc/udev/rules.d/70-persistent-net.rules

Since the hardware was changing it was appending the new hardware to the end of this list. Thus the system searched for the first one and then moved onto the next. Simply removing the content and restarting solved this issue.

Another issue was the locals. I am not sure why this one popped up but after some reading through some material I just generated the locale and reconfigured it. Using the following commands:

sudo locale-gen fi_FI.UTF-8
sudo dpkg-reconfigure locales

Wits Facts from Wits Weekly of November 27th


Friday, 29 November 2013

Documentation for "Evidence for Higgs Boson Decays to the τ + τ − Final State with the ATLAS Detector"

ATLAS has just released the documentation for the evidence for the decay H->tautau. The documentation can be found at:

http://cds.cern.ch/record/1632191/files/ATLAS-CONF-2013-108.pdf

The Wits group, some of its members and close collaborators have been involved in this search for a number of years.

ATLAS releases results which show evidence of the Higgs boson decaying to fermions

The ATLAS experiment just released preliminary results that show evidence, with a significance of 4.1 standard deviations, that the Higgs boson decays to two taus (which are fermions).

More information on this very exciting result can be found here:

http://www.atlas.ch/news/2013/higgs-into-fermions.html


Tuesday, 26 November 2013

Improving Higgs plus Jets analyses through Fox-Wolfram Moments

The preprint with the title "Improving Higgs plus Jets analyses through Fox-Wolfram Moments" has appeared in the archive today. Among other things one can see a discussion on how the jet veto can be replaced by a moment pertaining to extra jet radiation.

The abstract can be found below:

"It is well known that understanding the structure of jet radiation can significantly improve Higgs analyses. Using Fox-Wolfram moments we systematically study the geometric patterns of additional jets in weak boson fusion Higgs production with a decay to photons. First, we find a significant improvement with respect to the standard analysis based on an analysis of the tagging jet correlations. In addition, we show that replacing a jet veto by a Fox-Wolfram moment analysis of the extra jet radiation almost doubles the signal-to-background ratio. Finally, we show that this improvement can also be achieved based on a modified definition of the Fox-Wolfram moments which avoids introducing a new physical scale below the factorization scale. This modification can reduce the impact of theory uncertainties on the Higgs rate and couplings measurements."

The link to the preprint is:

http://arxiv.org/abs/1311.5891

As Tilman likes to say

Aloha


Sunday, 24 November 2013

Set up "Array" of two Cubieboard2's with MPI and HPL

Now that I have been able to get HPL working on the Cubieboard2 the next step would be to get it working on an array of boards. I was only able to get my hands on two boards so I am treating this as a proof of principle for later larger arrays.

If you do not have HPL set up on your board and would like a walk through please see my previous post: Installing HPL on Cubieboard2

Before we start. This is the setup I am using: Two Cubieboard2 running Ubuntu 13.10. Each board has one CPU and 2 cores with 1GB DDR3 RAM. In total we have 4 cores and 2GB RAM. I have called the boards cubiedev1 and cubiedev2 (Host names). OK lets get started.

MPI needs to be able to identify the nodes (actual machines or computers) so that it can execute the programs on each of the nodes cores. to do this we need to set up a hosts file.

Host names on Master Node

On the master node (Generally the node where you will issue the tests and store results) edit the host names file and add in the corresponding computers with their designated IP's. 

nano /etc/hosts

127.0.0.1 localhost
192.168.1.1 cubiedev1
192.168.1.2 cubiedev2

Note that you must not have the master node specified as localhost. I.E. You must not have 127.0.0.1 cubiedev1... Even if this is true for this board it will cause the other nodes to try connect to localhost when connecting to cubiedev1.

Using NFS for Ease of Testing

NFS allows you to mirror a hard drive over the network. This is extremely useful for us since to run a program such as HPL, the exact same version must be installed on all of the nodes. So instead of copying the program to all nodes we can mirror the drive and then do all our editing once and not have to worry about distributing the program around. 

To install run:

sudo apt-get install nfs-kernel-server

Now we need to share the folder we will work in... The sd card that the cubieboard has its OS on is only 8GB. I have an external HDD mounted in the directory /mnt/cub1/ if you want to mirror a folder on your sdcard its not a problem but the r/w speeds are generally not that great and you are limited by the size. So I created a directory called mpiuser on /mnt/cub1/ and I will run all my tests from this folder.

So now we have the directory /mnt/cub1/mpiuser and we must edit the folder exports and add the directory and restart the nfs service.

nano /etc/exports

/mnt/cub1/mpiuser *(rw,sync)
sudo service nfs-kernel-server restart

The folder mpiuser has now been shared but we need to mount this on the other nodes and link it to the master node. We can do this manually from the terminal each time we boot with the mount command or we can edit the fstab file so it mounts at boot.

nano /etc/fstab

cubiedev1:/mnt/cub1/mpiuser    /mnt/cub1/mpiuser    nfs

sudo mount -a
repeat on each node

Creating the user for all MPI programs

Creating one user with the same name and password on each board will allow us to easily access each node over ssh. We need to create the user and set the home directory to our shared folder mpiuser. We then also need to change the ownership of the folder to this user.

sudo adduser mpiuser --home /mnt/cub1/mpiuser  
sudo chown mpiuser /mnt/cub1/mpiuser 

Make sure that the password is the same on all boards.

Configure SSH to use keys and not passwords

Change to our new user:
su - mpiuser

Create the key using
ssh-keygen -t rsa

Use the default location as this is now a shared directory and will update to all nodes.
Now we need to add this key to the authorized keys:
cd .ssh  
cat id_rsa.pub >> authorized_keys

If you can ssh into the other nodes using their host names then you have set it up correctly. Test using:
ssh cubiedev2

MPI software

I have already installed the MPICH2 for my MPI program as I did this in the previous post mentioned before. You can use OpenMPI. It's up to you. 

We need to set up a machine file. This file will be a flag when running using the mpi command. It is a list of hosts with the specified number of nodes that you want to use. An example of the machines file that I have is:

cubiedev1:2 #The :2 represents the number of cores
cubiedev2:2

To test if this works we will use a simple test script which can be found on this blog. Save the content below to a file called mpi_hello.c

#include 
#include 

int main(int argc, char** argv) {
    int myrank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from processor %d of %d\n", myrank, nprocs);

    MPI_Finalize();
    return 0;
}

Compile it with
mpicc mpi_hello.c -o mpi_hello

Now run it with the correct number of specified processors (1 for each core)
mpirun -np 4 -f machines ./mpi_hello

The output I get is:
Hello from processor 0 of 4
Hello from processor 1 of 4
Hello from processor 2 of 4
Hello from processor 3 of 4

Cool... Now we know that all the processors are being "seen".

Set up the HPL files

Copy the HPL files that you have been using into the mpiuser directory on the shared hdd. Make sure the ower is set correctly via the chown hpl mpiuser command. If you are unsure of how to set up HPL please see Installing HPL on Cubieboard2

Set the HPL.dat file so that the product of P x Q = 4 (since we running it on both cubieboards) also make sure your problem size is large enough.

Now run HPL using:
mpirun -np 4 -f machines ./xhpl

Saturday, 23 November 2013

The sROD Module for the ATLAS Tile Calorimeter Phase-II Upgrade Demonstrator

A proceedings to the International Topical Workshop on Electronics for Particle Physics has been reviewed and approved by the ATLAS collaboration.  This document is a milestone towards the design an implementation of the Upgrade Demonstrator for the ATLAS TileCal electronics. The abstract can be found below:


"TileCal is the central hadronic calorimeter of the ATLAS experiment at the Large Hadron Collider (LHC) at CERN. The main upgrade of the LHC to increase the instantaneous luminosity is scheduled for 2022. The High Luminosity LHC, also called upgrade Phase-II, will imply a complete redesign of the read-out electronics in TileCal. In the new read-out architecture, the front-end electronics aims to transmit full digitized information to the back-end system in the counting rooms. Thus, the back-end system will provide digital calibrated information with en- hanced precision and granularity to the first level trigger to improve the trigger efficiencies. The demonstrator project is envisaged to qualify this new proposed architecture. A reduced part of the detector, 1/256 of the total, will be upgraded with the new electronics during 2014 to evaluate the proposed architecture in real conditions. The upgraded Read-Out Driver (sROD) will be the core element of the back-end electronics in Phase-II The sROD module is designed on a double mid-size AMC format and will operate under an AdvancedTCA framework. The module includes two Xilinx Series 7 FPGAs for data receiving and processing, as well as the implementation of embedded systems. Related to optical connectors, the sROD uses 4 QSFPs to receive and transmit data from the front-end electronics and 1 Avago MiniPOD to send preprocessed data to the first level trigger system. An SFP module maintains the compatibility with the existing hardware. A complete description of the sROD module for the demonstrator including the main functionalities, circuit design and the control software and firmware will be presented."

Below is de PCB layout of the sROD:





More information is available at the CDS entry:

https://cds.cern.ch/record/1628753

Wednesday, 20 November 2013

Assembling scintillator counters for the TileCal

Today the assembly of the gap scintillators with Bicron plastics started in  building 175 today. The plastics were produced and machined by Bicron.

Below is a photo of Bicron plastics before being placed in the aluminum case:





Below is a photo of the assembly table. To the left is Charles Sandrock, head technician of the Wits School of Physics. His trip is supported by the SA-CERN consortium.





EDM Connector Altium Library

The Wandboard uses something called an EDM Connector to connect the mezzanine board to the base board. This connector is actually a standard MXM3.0 connector used by graphics cards in laptops. Some people over at http://www.edm-standard.org/ have begun work on developing a standard that uses this connector for multimedia and other general signals - which is exactly why the Wandboard implements this standard! Download the PDF from their web site and check it out!

I was unable to find any schematic libraries on the internet for this connector, including in the Altium libraries! Thus, I was forced to make one. I'm making it available for free use - but if you improve it please share your improvements with me, and others. The footprint is based off both the Foxconn and JAE datasheets which can be found at Digi-Key and Future Electronics.



Edit:
Note that the JAE connector will not work for the EDM standard as some of the E3 and E4 signals are not present, which means the ethernet will be unimplemented. The connector is mostly the same though. You can see the notch to the left of the picture above - this should not be there.

The JAE connector should work. The missing signals from the notch to the left of the connector are as follows: E1-10, E2-10, E3-1, E4-1. These are VCC and GND signals which are elsewhere on the connector. We can assume that the board connected is using a ground and power plane so this should not adversely affect operation.

Here's my link on Google Drive: MXM3.IntLib. You should be able to download it without already having Google Drive yourself. Google might try open it in Google Docs - I'm not sure why - but click on File -> Download and you will get the actual file!

Friday, 15 November 2013

Benchmarking of ARM processors with CMS software

CMS colleagues have recently reported benchmarking of ARM processors with CMS software. Interesting and promising results:

http://arxiv.org/pdf/1311.0269v1.pdf

Prototype of LED board for the front-end electronics test-bench of the ATLAS TileCal

The first certified prototype of the LED board for the front-end electronics the new test bench of the ATLAS TileCal is now available at Wits.

Expect next and final iteration next week and delivery to CERN early December.




Installing HPL on Cubieboard2 + Ubuntu 13.10

I am following almost exactly the same procedure as my previous post with Ubuntu 12.04. Here we are working with Ubuntu 13.10 Server on the Cubieboard2 which can be found here: http://www.cubieforums.com/index.php/topic,891.0.html

System Specs

  • Cubieboard 2
    •  Processor         - Allwinner A20
    •  Cores               - Cortex-A7 Dual core
    •  Graphics PU      - ARM® Mali400MP2
    •  Memory           - 1GB DDR3
  • Using Ubuntu 13.10 Server
    • This version uses hardfp which is more suited for the arm and makes use of the VFP
    • The GCC compiler for 13.10 is more updated than 12.04. We have 4.7

Prerequisites

HPL requires the availability of a Message Passing Interface (MPI) and either the Basic Linear Algebra Subprograms (BLAS) or Vector Signal Image Processing Library (VSIPL). In my case I have used MPICH2 and the ATLAS package both of which I got from the repository. Before you start thinking why I have not used an ATLAS tuned BLAS and that my results will be poor because of it I remind you that my main objective is to have HPL up and running first and foremost. There are too many things that can go wrong in the ATLAS tuned BLAS approach. I will however get to these topics in future posts.

Get the required packages

sudo apt-get install mpich2
sudo apt-get install libatlas3-base-dev

Then get the HPL source code from http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
And extract it to a folder in your home directory. We need to produce the generic make file and then edit this according to our system.

Now to install

tar -xvf hpl-2.1.tar.gz
cd hpl-2.1/setup
sh make_generic
cp Make.UNKNOWN ../Make.cubieboard

Now you must link your MPI libraries correctly in order for the build to incorporate multi core support. It took me a few hours of changing things around till I got it working. This is what I had to change in the end.

ARCH       = cubieboard
TOPdir     = $(HOME)/HDD/hpl-2.1
MPdir      = /usr/lib/mpich2
MPinc      = -I$(MPdir)/include
MPlib      = /usr/lib/libfmpich.a
LAdir      = /usr/lib/atlas-base/
LAlib      = $(LAdir)/libf77blas.so.3 $(LAdir)/libatlas.so.3
HPL_LIBS   = $(HPLlib) $(LAlib) $(MPlib) -lmpl -lcr
CCFLAGS    = $(HPL_DEFS) -mfpu=neon -mfloat-abi=hard -funsafe-math-optimizations -ffast-math -O3

Just make sure you use the correct TOPdir and if you have your libraries in different locations then change the above accordingly. I added the CCFLAGS as I wanted the best results (knowing I have standard BLAS libraries). Here is my entire make file if you would like to compare Make.cubieboard-U13.10 .

Now compile HPL

make arch=cubieboard

HPL has a large amount of input variables and an even large combination of them that can be very intimidating. I still have not wrapped my head around all of them. If you go into the HPL.dat file you will see what I mean. You can find it in the bin/cubieboard/ folder. You can find a full explanation of what the input variables do here. A very useful site I found gives you a standard HPL.dat file to start from. So lets start by going to the site and filling out the specs you need. Below is the HPL.dat file that I used.

HPLinpack benchmark input file
University of the Witwatersrand
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
8000         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Note that you must specify the number of cores that you want to run on. So in our case here the Cubieboard2 is a dual core hence we specify Ps X Qs = 1 X 2 = 2. If you wanted to run this on a single core then you would set Ps = Qs = 1. If you do not have the correct number of cores then you will get an error when running HPL. Note that if you run multiple process grids then you must start HPL with the maximum number of cores that are needed.

Now to start HPL on both cores I need to run the mpi command. This is done with

mpirun -np 2 ./xhpl

The -np determines the number of cores. This must be the same as the product Ps X Qs. The output is then piped to the file HPL.out

Next Up

This was largely successful as it proves that the HPL is working on both cores. The next steps will be to custom tune the BLAS libraries and also optimise the OS with better configured Kernels. This will be explain in a different post by Mitch.


First Wits press release of January's workshop: Big data under the spotlight


The Wits press office has released the first piece to advertise the Big Data workshop in January:

http://www.wits.ac.za/newsroom/newsitems/201312/22094/news_item_22094.html

Check it out!


Big data under spotlight


15 November 2013
One of the biggest challenges for scientists this century will be to develop supercomputers that can process huge data output from big science projects such as the Square Kilometre Array.
To address this and other data questions, renowned world physicists and engineers will be attending the 2014 High-performance Signal and Data Processing workshop to be held at Wits University from 27 to 31 January 2014.

They include Dr Peter Jenni, one of the “founding fathers” and former spokesperson of the ATLAS experiment at the CERN Large Hadron Collider in Switzerland that discovered the Higgs boson in 2012, and Dr Bernie Fanaroff, Wits alumni and Project Director of the Square Kilometre Array.
“There are supercomputers in the world, but they are essentially doing a lot of computation and are extremely expensive. We want to process big flows of data,” says Professor Bruce Mellado from the High Energy Physics Group (HEP) in the School of Physics at Wits University. Together with his colleagues in HEP and fellow workshop organisers, Dr Oana Boeriu and Dr Trevor Vickey, Mellado and his team is developing and building a high-throughput supercomputer.

“Called the Massive Affordable Computing (MAC) project, HEP aims to use existing computer elements available on the market to build supercomputers that are cheap and energy efficient,” Mellado says.

Processing the vast quantities of data that the SKA will produce will require very high performance central supercomputers capable of 100 petaflops per second processing power. This is about 50 times more powerful than the current most powerful supercomputer and equivalent to the processing power of about one hundred million PCs. The technological challenges related to high-throughput data flows at the ATLAS detector today are common to those facing the SKA in the future.

With this workshop, themed Challenges in Astro- and Particle Physics and Radio Astronomy Instrumentation, the organisers aim to bring together key people to discuss the grand challenges facing the signal processing community in Radio Astronomy, Gamma Ray Astronomy and Particle Physics. But, the development of high-throughput computers will also have a revolutionary impact on data processing in all fields of science - including the medical sciences, palaeosciences and engineering - and the organisers hope to attract delegates from those fields of study as well.
The workshop will also have plenary sessions for in-depth presentations and knowledge sharing between delegates will be in lecture format, as well as a classroom environment for hands-on hardware training. General overviews and in-depth presentations will be given.

Students and young researches are also welcome to deliver presentations and encouraged to submit abstracts. The registration and abstract submission are now open till 31 December 2013.
It is envisioned to publish a book of proceedings. Proceedings will be peer reviewed. The deadline for proceedings submission is 15 February 2014. The conference is co-presented by the SKA Africa, the University of Cape Town, the National Research Foundation/iThemba Labs, Stellenbosch University and CERN-SA.




Building a Cubieboard Kernel: Part 1

To date it seems that all of the pre-compiled kernels and toolchains online for the Cubieboard are using stock parameters which tend to be tuned for the Cortex-A9 or built using an older version of GCC which does not fully support the Cortex-A7 CPU!

For these reasons, and also because I would like to make a more 'Lean and Mean' kernel with less pointless drivers to waste memory, I have endeavoured to build my own. This post will describe the general process of building a kernel for the Cubieboard. I will note a few initial changes I made to the kernel config but there needs to be some testing before I can conclude whether my changes (and more to come, I'm sure) are worth it or not. I plan on making a 'Part 2' to confirm performance changes and my final kernel config.

Let's Get Started!

The first step is to ensure you have a working cross-compiler toolchain installed. If you do not, see my post here on setting up the latest Linaro toolchain. This post describes how to modify this toolchain to be more optimised for the Cortex-A7.

Besides the toolchain setup, above, please make sure you have u-boot-tools installed:

sudo apt-get install u-boot-tools

This package contains the mkimage command that is required to make the final image. You then need to get the source code. Kernel sources tend to be huge so I opted to get only the latest revision of code and no history. I think this at least halved the download size!

git clone --depth 1 https://github.com/linux-sunxi/linux-sunxi.git --branch sunxi-3.4

This was about a 400 MB download. Once it completes, there is a handy command to load an initial working config for the Cubieboard:

make ARCH=arm CROSS_COMPILE=${CC201310A7} sun7i_defconfig


Some Config Changes

If you would like to view or modify this default configuration then you can get to the normal menuconfig with:

make ARCH=arm CROSS_COMPILE=${CC201310A7} menuconfig

This will bring up the classic Linux kernel menuconfig. Here you can browse through, see some info on the various items with the help command, and change things! Be sure to save the config when you are done: there is a save option near the bottom of the main menu. Save the config as .config for it to be used by the make command.

As I mentioned earlier I chose to modify a few things in this initial run. I plan on comparing the performance between the kernels supplied by the community, a kernel that is a stock configuration but compiled for the Cortex-A7 with GCC 4.8 and also a kernel with my modifications to the config.

Initially, I chose to only turn of forced preemption, which should allow higher throughput by telling the kernel to not jump through tasks too quickly. The default was set to a real-time system which is great for desktop, but not great for processing tasks. Here's how you find the setting:

Kernel Features -> Preemption Model -> No Forced Preemption


Another issue I discovered was that by default, the ethernet drivers are not compiled into the kernel - they are build as a module. This means that to use the module we have to manually tell Linux to load it. I don't want this behaviour, so I specified to build the ethernet drivers into the kernel.


Note that to get ethernet to work after you first boot, later on in the process, you will probably have to tell the system to bring up the interface and add some stuff to the config files so that this happens on boot:

ifconfig eth0 up
echo auto eth0 >> /etc/network/interfaces
echo iface eth0 inet dhcp >> /etc/network/interfaces

The Build

Once you are happy with your changes you can build the kernel. Modify the -j3 to -j(number of CPU's + 1) to suite your build system for a faster build.

make ARCH=arm CROSS_COMPILE=${CC201310A7} uImage modules -j3

and then

make ARCH=arm CROSS_COMPILE=${CC201310A7} INSTALL_MOD_PATH=output modules_install

This will take a while... Once it's done you only need to copy the kernel uImage and modules onto your SD card! The commands below will do this for you. Note that I have mounted the boot partition of my SD card to /media/boot and the rootfs to /media/rootfs. If the uImage file is missing then the compile above failed at some point.

sudo cp -v arch/arm/boot/uImage /media/boot/
sudo rm -r /media/rootfs/lib/*
sudo cp -rv output/* /media/rootfs/lib/

Unmount the SD card, put it in your Cubieboard and hope for the best! ;)

Thursday, 14 November 2013

Conclusions from first Back-end Front-end integration excercise


The following is the outcome of the first Back-end Front-end integration exercise:
  • The communication between Daughterboard (front-end) and the Super Read-out Driver, sROD, emulator (front-end) has been successfully tested.
  • Achieved upstream and downstream flow.
  • Two different platforms were used as sROD emulator: Xilinx Virtex 7 and Xilinx Kintex 7. Obtained satisfactory results with both evaluation boards.
  • An overnight test has been performed to spot potential transmission errors. No error was identified after 14 hours. 

Tests have been performed between an external PC and the readout chain using an IP bus:
  • An IP bus interface has been integrated in the design of the sROD emulator.
  • This facilitates easy access to hardware registers in the sROD emulator over ethernet. 
  • Different communication tests have been performed successfully using the IP bus interface, allowing to use the external PC to read and write registers in the sROD emulator. 


Tests communication between the Daughterboard and the Mainboard:
  • Commands and data words have been transmitted from the sROD emulator to the daughterboard using the IPbus interface.
  • The Daughterboard processes and transmits the commands and data to the Mainboard.
  • This test was not completely successful. The daughterboard reacted to the commands received over IPbus, but not in the expected way.
  • Bug probably in the state machine that decodes the command into specific actions for the Mainboard. 

Next integration session will take place in December at CERN.

To the right, Pablo Moreno, from Wits.




Wednesday, 13 November 2013

Installing HPL on Wandboard + Ubuntu 12.04

With the ultimate goal of benchmarking an array of Wandboards a good starting point is to install High Performance LINPACK on one machine and then begin expanding the number of boards. This post will discuss how to configure all the required packages and the HPL itself to get it up and running. Note: This is not a discussion on tuning or optimizing for high flop counts but just to get a working benchmark. I will discuss tuning in a later post.

System Specs

  • Wandboard Quad
    •  Processor         - Freescale i.MX6 Quad
    •  Cores               - Cortex-A9 Quad core
    •  Graphic engine  - Vivante GC 2000 + Vivante GC 355 + Vivante GC 320
    •  Memory           - 2GB DDR3
  • Using Ubuntu 12.04 LTS
    • This version uses softfp which is not ideal and hence why I am not going to tune the HPL
    • The GCC compiler for 12.04 is outdated when it comes to ARM and does not support hardfp
    • Ubuntu 13.10 is available and has the much needed updates. I will move across to this soon when I have a decent set of results to compare the 12.04 and 13.10 versions. This will be a nice comparison primarily between the hardfp vs softvp effect.

Prerequisites

HPL requires the availability of a Message Passing Interface (MPI) and either the Basic Linear Algebra Subprograms (BLAS) or Vector Signal Image Processing Library (VSIPL). In my case I have used MPICH2 and the ATLAS package both of which I got from the repository. Before you start thinking why I have not used an ATLAS tuned BLAS and that my results will be poor because of it I remind you that my main objective is to have HPL up and running first and foremost. There are too many things that can go wrong in the ATLAS tuned BLAS approach. I will however get to these topics in future posts. I assume you have the standard compilers that have come with the Ubuntu 12.04 image. If you are wondering why I used MPICH2 rather than OpenMPI is that MPICH2 worked first :)

Get the required packages

sudo apt-get install mpich2
sudo apt-get install libatlas-base-dev

Then get the HPL source code from http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
And extract it to a folder in your home directory. We need to produce the generic make file and then edit this according to our system.

Now to install

tar -xvf hpl-2.1.tar.gz
cd hpl-2.1/setup
sh make_generic
cp Make.UNKNOWN ../Make.wandboard

Now you must link your MPI libraries correctly in order for the build to incorporate multi core support. It took me a few hours of changing things around till I got it working. This is what I had to change in the end.

ARCH       = wandboard
TOPdir     = $(HOME)/HDD/hpl-2.1
MPdir      = /usr/lib/mpich2
MPinc      = -I$(MPdir)/include
MPlib      = /usr/lib/libfmpich.a
LAdir      = /usr/lib/atlas-base/
LAlib      = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a
HPL_LIBS   = $(HPLlib) $(LAlib) $(MPlib) -lmpl -lcr
CCFLAGS    = $(HPL_DEFS) -mfpu=neon -mfloat-abi=softfp -funsafe-math-optimizations -ffast-math -O3

Just make sure you use the correct TOPdir and if you have your libraries in different locations then change the above accordingly. I added the CCFLAGS as I wanted the best results (knowing I have standard BLAS libraries). Here is my entire make file if you would like to compare Make.wandboard-U12.04.

Now compile HPL

make arch=wandboard

HPL has a large amount of input variables and an even large combination of them that can be very intimidating. I still have not wrapped my head around all of them. If you go into the HPL.dat file you will see what I mean. You can find it in the bin/wandboard/ folder. You can find a full explanation of what the input variables do here. A very useful site I found gives you a standard HPL.dat file to start from. So lets start by going to the site and filling out the specs you need. Below is the HPL.dat file that I used.

HPLinpack benchmark input file
University of the Witwatersrand
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10240         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Note that you must specify the number of cores that you want to run on. So in our case here the Wandboard is a quad core hence we specify Ps X Qs = 2 X 2 = 4. If you wanted to run this on a single core then you would set Ps = Qs = 1. If you do not have the correct number of cores then you will get an error when running HPL. Note that if you run multiple process grids then you must start HPL with the maximum number of cores that are needed.

Now to start HPL on all four cores I need to run the mpi command. This is done with

mpirun -np 4 ./xhpl

The -np determines the number of cores. This must be the same as the product Ps X Qs. The output is then piped to the file HPL.out

Next Up

This was largely successful as it proves that the HPL is working on 4 cores. I quite like the idea of having results before any optimizations as we can quantitatively see how much improvement we get from the tuned BLAS and new hardfp we will have in the next set of testing on the Ubuntu 13.10 system.


ARM Cortex-A7 GCC Flags (Allwinner A20)

I have been digging around to try to optimise the GCC flags for the Cubieboard (Allwinner A20 SoC). This post serves as a convenient reference for all into some of the technical specifications and their associated GCC flags.

Compiler Version

Please take note that the Cortex-A7 architecture is quite new and therefore the version of compiler you choose is very important. The relevant change notes from GCC are as follows:

GCC 4.6 - No official support for the Cortex-A7
GCC 4.7 - Support was added (http://gcc.gnu.org/gcc-4.7/changes.html)
GCC 4.8 - Code generation improvements for Cortex-A7 (http://gcc.gnu.org/gcc-4.8/changes.html)

Therefore, you want to try to use at least GCC 4.8. Generally code from an older compiler will still work but it will be optimised for the Cortex-A9, for example, which has a VFPv3 FPU, but compiled for ARMv7a generally.

Allwinner A20 CPU Specifications

  • Dual core Cortex-A7 @ ~1 GHz
  • 256 KB L2 Cache (Shared)
  • 32 KB / 32 KB L1 Cache per core
  • SIMDv2 NEON
  • VFPv4-D16 Floating Point Unit (FPU)
  • Virtualisation Extensions
The ARM web site provides a lot of information regarding the specifications of the Cortex-A7: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0463d/ch02s01s02.html

GCC Flags

A complete list of avaliable GCC flags for ARM can be found here: http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html Each flag has a short (and sometimes longer) description.

The obvious flags are to set the CPU type. The -mcpu flag is quite specific. If you wanted to compile in a more generic manner you could set -march to the ARMv7a architecture and then only use -mtune and not -mcpu which would allow the code to run on all ARMv7a CPU's, but be more optimised for a certain CPU. Generally you should set either -mcpu or -march. I am not interested in this behaviour since it's an embedded system and I want it to be as optimised as possible. -mtune is redundant when -mcpu is set, but one never really knows what's going on inside the compiler so I set both just in case!

-mcpu=cortex-a7
-mtune=cortex-a7

The next most important flags are for the FPU. Software floating point is basically a disaster so don't waste your time with it. We have both VFPv4-D16 and NEON. These are like two sides of the same coin. NEON enables SIMD (Single Instruction Multiple Data) for the VFPv4 FPU, which will provide a large speedup if the compiler can use it. There is a snag, however: NEON is not fully IEEE 754 compliant. What this means is that denormal values are treated as zero, which can lead to some floating-point accuracy issues. I plan on doing some more research into this issue, and if it is really worth worrying about.

The flags below are applicable:

-mfpu=vfpv4-d16
-mfpu=neon-vfpv4

Note that if you use NEON then you should also use:

-funsafe-math-optimizations

This is because of the IEEE 754 compliance mentioned earlier. You need to specifically tell the compiler that you don't mind about any accuracy issues - otherwise it essentially won't use the NEON extensions!

Setting the Flags in a Cross-Compiler

I am using the Linaro Cross Toolchain which I showed how to set up in a previous post. The issue with this toolchain is that it is tuned for the Cortex-A9 processor. In order to change this permanently, one has to rebuild the toolchain (what a strain!). The Linaro FAQ tells us this.

Of course there is the manual way where you specify the flags to gcc when it is called, however this does not work well with the make command. Instead, I created a bash script and replaced the gcc symlink in the toolchain /bin directory with this script. Don't forget to make it executable!

#!/bin/bash
/path/to/your/cross/gcc -mtune=cortex-a7 -mfpu=neon-vfpv4 $*

This will now transparently call the gcc compiler with the flags, every time. If you try to force the -mfloat-abi to hard, you will find that U-Boot fails to compile. This is because U-Boot compiles with software floating point due to their own reasoning.

Tuesday, 12 November 2013

First attempt to integrate Back-end and Front-end for the TileCal upgrade

Today we attempted for the first time to integrate the Back-end and Front-end of the new upgraded TileCal electronics. Work performed in collaboration with IFIC, Stockholm University, University of Chicago, Argonne National Laboratory and CERN. In the picture one can see the daughter board that will sit on the detector, connected with FPGA evaluation boards that serve as simulators of the future sROD.

Linaro Cross-Compile Setup on Ubuntu 12.04

This is a simple guide that shows how to set up an Ubuntu 12.04 based build environment for Linaro.

It is convenient to use a virtual machine for a build environment since you can share it with other team members and also make backups for the entire system by simply duplicating the virtual hard disk.

The first step is to download Ubuntu 12.04 Server 64 bit. If you are not a fan of CLI, you should try to get used to it! :P The Server edition is more light-weight, more optimised for throughput and uses less RAM, letting you get faster builds. If you are installing on a PC or inside a VM the process is pretty much the same – I'll leave it to you! I would recommend at least 2 GB RAM, and 2 CPU’s, though.

I decided to install OpenSSH Server as well as SAMBA file sharing so that I can easily get files onto and off of the machine. I won't explain how to set these up – there are plenty of guides online already!

Install Some Packages

Once the machine is installed there are quite a few packages we need to install. Please type the commands below:

sudo apt-get install build-essential git ccache zlib1g-dev gawk bison flex gettext uuid-dev

Since we are on a 64 bit machine and the Linaro binaries rely on 32 bit as of present, you also need to install 32 bit libs:

sudo apt-get install ia32-libs

Finally, since we will probably be compiling a kernel at some point, we need ncurses for make menuconfig:

sudo apt-get install ncurses-dev

All together this is about 200 MB of downloads. Feel free to combine all the packages into one command and go grab a coffee if you aren't on a fast connection!


Install the Cross-Compiler

The Linaro project is becoming the industry standard in ARM platforms, in my opinion. They regularly produce cross-toolchains (a difficult and tedious task to do one-self) as well as following the Ubuntu release schedule and making some tweaks and releasing an ARM optimised version.

At the moment we are only interested in the toolchain. Please browse to https://releases.linaro.org/ to see what’s available and what the latest reasonable packages are. At the time of writing, 13.11 is the latest but I will be using 13.10 as this has already been compiled by some good people: https://launchpad.net/linaro-toolchain-binaries

It’s a good idea to create a new directory in your home directory (or wherever you want) for any Linaro related stuff. I’ll be using ~/linaro/ as the directory for all linaro stuff. Download the latest pakage that has been built and extract it:

cd ~/linaro

wget https://launchpad.net/linaro-toolchain-binaries/trunk/2013.10/+download/gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux.tar.xz

tar xJf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux.tar.xz

Remember the handy shortcut of pressing TAB after typing a few letters to auto-complete the rest of the file name!

You will now have extracted a new folder with the cross-compiler toolchain inside it. We will export a variable that will point to this directory so that we can call the GCC, etc. inside easily. I am going to name this variable something useful, besides just CC (cross-compiler) since it is probable that I will be downloading newer versions of the toolchain in future and I would like to be able to access any of them!

export CC201310=`pwd`/gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux/bin/arm-linux-gnueabihf-

Naturally you will replace the names with whatever version you are using. The `pwd` part prints the working directory. It’s a shortcut to writing out /home/you/linaro/, which you could do if you wanted to!

Now, whenever you type ${CC201310} followed by a command like gcc, you are actually pointing to that command inside the ${CC201310} directory. To test:

${CC201310}gcc --version

Should give you a result! If you get an error, make sure you installed everything earlier and the export went well.

You should export this every time you boot the machine, or stick it into a script that runs on boot. We can do this by adding a line at the end of ~/.bashrc:

export CC201310=/home/you/linaro/gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux/bin/arm-linux-gnueabihf-


That’s it!

Congratulations! Pretty simple, right? Now, if you want to make something using the cross compiler:

make ARCH=arm CROSS_COMPILE=${CC201310}

Obviously this make command isn't going to work if you type it in... Follow the instructions of the software you are trying to install! The command above is just a template. 

I should add that as of writing, the compiler is by default set up for any ARM Cortex with VFPv3 FPU but is tuned for the Cortex-A9. If you want to optimise for the Cortex-A7, etc. then you should specify this to the compiler. See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html. 

For the Cortex-A7, some example flags are below. Also note that you will have to do some extra reading on how to actually set these flags. The toolchain set up in this post should be adequate to get you going with no further tweaking...

-mcpu=cortex-a7 –mtune=cortex-a7 –mfpu=neon-vfpv4