System Specs
- Wandboard Quad
- Processor - Freescale i.MX6 Quad
- Cores - Cortex-A9 Quad core
- Graphic engine - Vivante GC 2000 + Vivante GC 355 + Vivante GC 320
- Memory - 2GB DDR3
- Using Ubuntu 12.04 LTS
- This version uses softfp which is not ideal and hence why I am not going to tune the HPL
- The GCC compiler for 12.04 is outdated when it comes to ARM and does not support hardfp
- Ubuntu 13.10 is available and has the much needed updates. I will move across to this soon when I have a decent set of results to compare the 12.04 and 13.10 versions. This will be a nice comparison primarily between the hardfp vs softvp effect.
Prerequisites
HPL requires the availability of a Message Passing Interface (MPI) and either the Basic Linear Algebra Subprograms (BLAS) or Vector Signal Image Processing Library (VSIPL). In my case I have used MPICH2 and the ATLAS package both of which I got from the repository. Before you start thinking why I have not used an ATLAS tuned BLAS and that my results will be poor because of it I remind you that my main objective is to have HPL up and running first and foremost. There are too many things that can go wrong in the ATLAS tuned BLAS approach. I will however get to these topics in future posts. I assume you have the standard compilers that have come with the Ubuntu 12.04 image. If you are wondering why I used MPICH2 rather than OpenMPI is that MPICH2 worked first :)
Get the required packages
sudo apt-get install mpich2 sudo apt-get install libatlas-base-dev
Then get the HPL source code from http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
And extract it to a folder in your home directory. We need to produce the generic make file and then edit this according to our system.
Now to install
tar -xvf hpl-2.1.tar.gz cd hpl-2.1/setup sh make_generic cp Make.UNKNOWN ../Make.wandboard
Now you must link your MPI libraries correctly in order for the build to incorporate multi core support. It took me a few hours of changing things around till I got it working. This is what I had to change in the end.
Just make sure you use the correct TOPdir and if you have your libraries in different locations then change the above accordingly. I added the CCFLAGS as I wanted the best results (knowing I have standard BLAS libraries). Here is my entire make file if you would like to compare Make.wandboard-U12.04.
Now compile HPL
ARCH = wandboard TOPdir = $(HOME)/HDD/hpl-2.1 MPdir = /usr/lib/mpich2 MPinc = -I$(MPdir)/include MPlib = /usr/lib/libfmpich.a LAdir = /usr/lib/atlas-base/ LAlib = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) -lmpl -lcr CCFLAGS = $(HPL_DEFS) -mfpu=neon -mfloat-abi=softfp -funsafe-math-optimizations -ffast-math -O3
Just make sure you use the correct TOPdir and if you have your libraries in different locations then change the above accordingly. I added the CCFLAGS as I wanted the best results (knowing I have standard BLAS libraries). Here is my entire make file if you would like to compare Make.wandboard-U12.04.
Now compile HPL
make arch=wandboard
HPL has a large amount of input variables and an even large combination of them that can be very intimidating. I still have not wrapped my head around all of them. If you go into the HPL.dat file you will see what I mean. You can find it in the bin/wandboard/ folder. You can find a full explanation of what the input variables do here. A very useful site I found gives you a standard HPL.dat file to start from. So lets start by going to the site and filling out the specs you need. Below is the HPL.dat file that I used.
HPLinpack benchmark input file University of the Witwatersrand HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 10240 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 2 Ps 2 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) ##### This line (no. 32) is ignored (it serves as a separator). ###### 0 Number of additional problem sizes for PTRANS 1200 10000 30000 values of N 0 number of additional blocking sizes for PTRANS 40 9 8 13 13 20 16 32 64 values of NB
Note that you must specify the number of cores that you want to run on. So in our case here the Wandboard is a quad core hence we specify Ps X Qs = 2 X 2 = 4. If you wanted to run this on a single core then you would set Ps = Qs = 1. If you do not have the correct number of cores then you will get an error when running HPL. Note that if you run multiple process grids then you must start HPL with the maximum number of cores that are needed.
Now to start HPL on all four cores I need to run the mpi command. This is done with
Now to start HPL on all four cores I need to run the mpi command. This is done with
mpirun -np 4 ./xhpl
The -np determines the number of cores. This must be the same as the product Ps X Qs. The output is then piped to the file HPL.out
Next Up
This was largely successful as it proves that the HPL is working on 4 cores. I quite like the idea of having results before any optimizations as we can quantitatively see how much improvement we get from the tuned BLAS and new hardfp we will have in the next set of testing on the Ubuntu 13.10 system.
No comments:
Post a Comment