Wednesday, 13 November 2013

Installing HPL on Wandboard + Ubuntu 12.04

With the ultimate goal of benchmarking an array of Wandboards a good starting point is to install High Performance LINPACK on one machine and then begin expanding the number of boards. This post will discuss how to configure all the required packages and the HPL itself to get it up and running. Note: This is not a discussion on tuning or optimizing for high flop counts but just to get a working benchmark. I will discuss tuning in a later post.

System Specs

  • Wandboard Quad
    •  Processor         - Freescale i.MX6 Quad
    •  Cores               - Cortex-A9 Quad core
    •  Graphic engine  - Vivante GC 2000 + Vivante GC 355 + Vivante GC 320
    •  Memory           - 2GB DDR3
  • Using Ubuntu 12.04 LTS
    • This version uses softfp which is not ideal and hence why I am not going to tune the HPL
    • The GCC compiler for 12.04 is outdated when it comes to ARM and does not support hardfp
    • Ubuntu 13.10 is available and has the much needed updates. I will move across to this soon when I have a decent set of results to compare the 12.04 and 13.10 versions. This will be a nice comparison primarily between the hardfp vs softvp effect.

Prerequisites

HPL requires the availability of a Message Passing Interface (MPI) and either the Basic Linear Algebra Subprograms (BLAS) or Vector Signal Image Processing Library (VSIPL). In my case I have used MPICH2 and the ATLAS package both of which I got from the repository. Before you start thinking why I have not used an ATLAS tuned BLAS and that my results will be poor because of it I remind you that my main objective is to have HPL up and running first and foremost. There are too many things that can go wrong in the ATLAS tuned BLAS approach. I will however get to these topics in future posts. I assume you have the standard compilers that have come with the Ubuntu 12.04 image. If you are wondering why I used MPICH2 rather than OpenMPI is that MPICH2 worked first :)

Get the required packages

sudo apt-get install mpich2
sudo apt-get install libatlas-base-dev

Then get the HPL source code from http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
And extract it to a folder in your home directory. We need to produce the generic make file and then edit this according to our system.

Now to install

tar -xvf hpl-2.1.tar.gz
cd hpl-2.1/setup
sh make_generic
cp Make.UNKNOWN ../Make.wandboard

Now you must link your MPI libraries correctly in order for the build to incorporate multi core support. It took me a few hours of changing things around till I got it working. This is what I had to change in the end.

ARCH       = wandboard
TOPdir     = $(HOME)/HDD/hpl-2.1
MPdir      = /usr/lib/mpich2
MPinc      = -I$(MPdir)/include
MPlib      = /usr/lib/libfmpich.a
LAdir      = /usr/lib/atlas-base/
LAlib      = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a
HPL_LIBS   = $(HPLlib) $(LAlib) $(MPlib) -lmpl -lcr
CCFLAGS    = $(HPL_DEFS) -mfpu=neon -mfloat-abi=softfp -funsafe-math-optimizations -ffast-math -O3

Just make sure you use the correct TOPdir and if you have your libraries in different locations then change the above accordingly. I added the CCFLAGS as I wanted the best results (knowing I have standard BLAS libraries). Here is my entire make file if you would like to compare Make.wandboard-U12.04.

Now compile HPL

make arch=wandboard

HPL has a large amount of input variables and an even large combination of them that can be very intimidating. I still have not wrapped my head around all of them. If you go into the HPL.dat file you will see what I mean. You can find it in the bin/wandboard/ folder. You can find a full explanation of what the input variables do here. A very useful site I found gives you a standard HPL.dat file to start from. So lets start by going to the site and filling out the specs you need. Below is the HPL.dat file that I used.

HPLinpack benchmark input file
University of the Witwatersrand
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10240         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Note that you must specify the number of cores that you want to run on. So in our case here the Wandboard is a quad core hence we specify Ps X Qs = 2 X 2 = 4. If you wanted to run this on a single core then you would set Ps = Qs = 1. If you do not have the correct number of cores then you will get an error when running HPL. Note that if you run multiple process grids then you must start HPL with the maximum number of cores that are needed.

Now to start HPL on all four cores I need to run the mpi command. This is done with

mpirun -np 4 ./xhpl

The -np determines the number of cores. This must be the same as the product Ps X Qs. The output is then piped to the file HPL.out

Next Up

This was largely successful as it proves that the HPL is working on 4 cores. I quite like the idea of having results before any optimizations as we can quantitatively see how much improvement we get from the tuned BLAS and new hardfp we will have in the next set of testing on the Ubuntu 13.10 system.


No comments:

Post a Comment