Friday, 16 May 2014

32 Core - 8 Wandboard Array Rack Mounted

We thought it was time to make our Wandboard Array a little bit more formal. We purchased a small tray for our rack and started work on mounting the array next to its power supply.

Thanks to the Wits Physics workshop who did a great job with mounting the power supply and brackets.

Final Wandboard Array installed in rack.

Wandboard array fresh from the workshop.


Tuesday, 13 May 2014

Extra Steps for Building a Wandboard (i.MX6Q) Image

I have built several Linaro based images for the Wandboard or Freescale i.MX6Q SoC, and it's a seemingly simple process of building the kernel, partitioning and installing the Linaro rootfs to a SD card. One hit's a wall when anything 'fancy' needs to be done on this image... this post aims to document how to finish off the image so that it can be used for kernel development and other advanced tasks.

Please note that this is a fairly advanced howto. Most of the concepts here can be found in other places online. The methodology behind the i.MX6 libraries comes from 'reverse engineering' the LTIB install scripts from Freescale, yocto, and lots of reading. I'm sure there is a better way to do this - but I don't know it (I'm interested to hear though)!

Linux Source Code and Modules
The Linux kernel source code should be cleaned (make ARCH=arm clean) and copied onto the image (or SD card - I'll use the terms randomly) into the /usr/src/linux directory.

Before you clean the tree, build the modules and headers and install them to a known path to be copied onto your SD card root filesystem.

make modules_install INSTALL_MOD_PATH=/some/directory/

make headers_install INSTALL_HDR_PATH=/some/directory/

Copy the contents of /some/directory to into the SD card /lib directory. You should now have a new directory: /lib/modules/3.0.35Linux+ or something similar. The headers_install directory goes into /usr.

When you boot with the new SD card, you will need to modify some symlinks (build and source) that reference the wrong place (they will be linking to directories from your build machine, which are obviously not valid any more).

cd /lib/modules/3.0.35Linux+/
ls -l

You should see the wrong links. Now to remove them and re-add them, correctly:

rm ./build
rm ./source
ln -s /usr/src/linux ./build
ln -s /usr/src/linux ./source

These commands will have to be run with sudo or as root.

Something you may also have to do, in case you cleaned the kernel source tree too thoroughly, or you cross compiled (the kernel scripts directory is full of x86 binaries which wont work on ARM: modpost, etc.), is ensure your proper .config file is there and run:

make ARCH=arm oldconfig
make ARCH=arm prepare
make ARCH=arm modules_prepare

I've also found that doing a module build fixed up some errors!

make ARCH=arm modules

Download the i.MX6 Libraries
Download and extract the Freescale BSP somewhere onto your PC (not the Wandboard). There is a source directory with loads of .gz files for various applications. We are interested in several of these. This section will explain how to install the important ones.

A list of the files we will be working with is below. Copy them onto the SD card before booting, or SCP them across. From what I have seen, the 3.0.35-4.1.0 and 3.10.17-1.0.0 are basically the same. The 3.10.17 files can be acquired from a partial yocto installation - Google can probably help with finding the files otherwise!

imx-lib-*.tar.gz
imx-vpu-*.bin
firmware-imx-*.bin
imx-test-*.tar.gz
gpu-viv-bin-mx6q-*hfp.bin (the version from the Freescale BSP seems to be soft-fp... it won't work.)
gpu-viv-g2d-*.bin
fsl-gpu-sdk-*.bin

Some of these are bins, with things to agree to before they extract. Go ahead and extract everything in preparation for the installation steps.

Install the i.MX6 Libraries
First, a few exports to make our lives easier:

export KERNELDIR='/usr/src/linux'

export INCLUDES='-I$KERNELDIR/include -I$KERNELDIR/drivers/mxc/security/rng/include -I$KERNELDIR/drivers/mxc/security/sahara2/include'

As per the usual, make sure the KERNELDIR variable points to your specific kernel directory. Note that if you have installed the kernel headers, etc. properly in the steps above, you may not need to do these exports and you can leave the INCLUDE= part out of the make commands below.

firmware:

Simply copy the contents of the firmware-imx* directory into /lib so that you have new files in /lib/firmware/vpu, etc.

imx-lib:

From the imx-lib directory:

make -j1 PLATFORM="IMX6Q" INCLUDE="$INCLUDES"

sudo make PLATFORM="IMX6Q" install 

If all compiled and copied, you should now see a bunch of new libraries in /usr/lib! Congratulations!

imx-vpu:

Even if you don't want to use the VPU, this is a dependency for the imx-tests. From the imx-vpu directory:

make -j1 PLATFORM="IMX6Q" INCLUDE="$INCLUDES"

sudo make PLATFORM="IMX6Q" install 

If all compiled and copied, you should now see a bunch of new libraries in /usr/lib! Congratulations!

gpu-viv-bin and gpu-viv-g2d:

Copy the contents of the archives into your root. /opt and /usr will now contain new files.

imx-test:

From the imx-test directory:


make -j1 PLATFORM="IMX6Q" test

At this point you can run some of the unit tests that compiled successfully (not all will have) from the ./platform/IMX6Q/autorun*.sh files.

sudo make PLATFORM="IMX6Q" install 

If all compiled and copied, you should now see a bunch of new libraries in /usr/lib! Congratulations!

gpu-viv-bin:

The GPU drivers and binaries are closed source, so it's a matter of extracting the files into the correct place. Search for gpu-viv-*.gz and copy it onto your Wandboard.

Boot, extract it and cd into the new directory. You will see an 'opt' and a 'usr' directory. Run the following commands:

sudo cp -Rv ./opt/* /opt/
sudo cp -Rv ./usr/* /usr/

Make sure the files copied into the correct places.

Tuesday, 6 May 2014

Upgraded Wandboard Cluster

We have invested in a few more Wandboards to upgrade our mini-cluster. Since I repurposed two of the existing boards for my PCI-Express research, we now have 8 Wandboards in the cluster which makes 32 cores! A little more power for scientific applications on ARM...

Besides having more processing power available, I have made some enhancements to the power distribution board. The thin purple, pink and red wires in the background will be connected to one of the Wandboard I2C inputs to enable easy and accurate voltage, current and power measurements of the DC supply to the cluster. The excellent TI INA219 chip is used for this. It's possible to read these values from the Linux command-line so we can use a shell script to plot power values directly to a file!


Friday, 11 April 2014

PCB for ADC board for ATLAS TileCal Prometeo system

Five PCBs for the ADC boards for the ATLAS TileCal Prometeo system manufactured in Johannesburg. The Prometeo system is designed for the upgrade. The first version will be used for the certification of the hybrid demonstrator.

Two PCBs have been sent to be populated. We are aiming at sending the boards to Europe by the end of the month.



Thursday, 10 April 2014

Announcement of Kruger 2014

WORKSHOP ON DISCOVERY PHYSICS AT THE LHC

KRUGER-2014

December 1 - 6, 2014

Protea Hotel Kruger Gate

Portia Shabangu Road, Skukuza, Mpumulanga, South Africa
Dear Colleagues

We are pleased to announce the Third  Biennial

   "Workshop on Discovery Physics at the LHC" (KRUGER 2014)
The Workshop will be held at the 4-star Protea Hotel Kruger Gate,
just 100 meters from the  entrance to the Kruger National Park.

Please find details in the conference web page
http://www.kruger2014.tlabs.ac.za
The conference aims to promote scientific exchange of new results
and development of
novel ideas and models related to the physics of the LHC.
The following topics will be covered:
 Particle Physics,
  Heavy Ion Physics,
 Physics  after the discovery of the  Brout-Englert-Higgs boson.
Accommodation, registration, abstract submission and other practical
details can be found
on the web page. Attendance will be limited to about 100 participants
because of the number of available  rooms in the hotel.

Students are encouraged to also take part in a related workshop/school on
  ``Hot and Dense Nuclear & Astrophysical Matter - HDM2014''
which will be organized by Professor Azwinndini
Muronga (amuronga@uj.ac.za)  at the University of Mafeking
November 24 - 28, 2014.

Other related events of interest to students are:
  ``Chris Engelbrecht School in Particle Physics'',
January 12 - 21, 2015,
and the
  ``High Performance Signal and Data Processing'',
January 26 - 30, 2015.
Limited funding for South African students is available.
We look forward to seeing you all in South Africa

The organizers of KRUGER-2014:

O. Boeriu (Witwatersrand, Johannesburg)  Z. Buthelezi  (iThemba LABS)

J. Cleymans (UCT, Cape Town) (Chair)  A. S. Cornell
(NITHeP/Witwatersrand, Johannesburg)

S. H. Connell (UJ, Johannesburg)  T. Dietel     (UCT, Cape Town)

S. Frtsch    (iThemba LABS)  N. Haasbroek  (iThemba LABS) (Secretary)

A. Hamilton (UCT, Cape Town)  W. A. Horowitz (UCT, Cape Town)

S. Karataglidis (UJ, Johannesburg)  B. Mellado  (Witwatersrand, Johannesburg)

E. Sideras-Haddad  (Witwatersrand, Johannesburg)  T. Vickey
(Witwatersrand, Johannesburg)

H. Weigert   (UCT, Cape Town)  S. Yacoob    (UKZN, Durban)

The email address of the conference is kruger2014@tlabs.ac.za




Wednesday, 2 April 2014

Dual Wandboard PCI-Express Connector Complete

The Wandboard PCI-Express adpater is complete! I'm in the process of building a new kernel that supports the PCI-Express RC and Endpoint to start testing the Freescale i.MX6 Quad PCI-Express capabilities.

Here are some photos of the completed board with two Wandboards attached! In the last photo you can see our (now slightly diminished) Wandboard cluster. We are waiting for another five Wandboards to make it 40 cores but no suppliers seem to have stock at the moment...




Friday, 28 March 2014

Wandboard PCI-Express Connector PCB Photos

Today I took delivery of the first (and hopefully last if it's bug-free) version of the dual Wandboard PCI-Express adapter. I plan on having it soldered and ready for testing by early next week. Check out the photos!



Wednesday, 12 March 2014

Wandboard PCI-Express Adapter: Update

It's been a while since I last posted about the Wandboard PCI-Express adapter I have been working on... I decided to redesign the PCB to be more compact. This saves manufacturing costs and it looks better, in my opinion.

The PCB has been sent for manufacture so hopefully in a week or two I can post some photos! Shortly after that - assuming everything goes according to plan - I'll post some results for the PCI-Express performance of the Freescale i.MX6 SoC. I don't think the PCI-Express interface to the Wandboard has been tested by anyone, so hopefully it works...



Tuesday, 11 March 2014

NAS Benchmarks on ARM

The NAS Parallel Benchmarks (link) are a comprehensive suite of benchmarks to test supercomputers, maintained by NASA. They were originally based on computational fluid dynamics (in 1994) and expanded over time to cover many different problem types as well as many problem sizes; from very small problems that run in a few seconds for testing purposes, to large problems that can take hours on a supercomputer!

Since these benchmarks cover a range of problems, most interestingly a specific Embarrassingly Parallel benchmark, it is important to test their performance on ARM. Luckily the task of building the benchmark suite on ARM is straightforward. I will document it here for those who are interested. I will write about performance tweaks and compiler flags in a later post once I have had more time to experiment.


Installation (Single Processor Test)

  • Download a copy of the source code from the web site linked above. Unzip the source into a directory on your ARM system.
  • You should already have a full suite of compilers (gcc) installed on your system, as well as MPICH or other MPI library.
  • Navigate into the NPB3.3-MPI directory. Please read the README.install text document for some details. There is a short document in each benchmark directory with some details about that specific benchmark.
  • Navigate into the 'config' directory.
  • Run this command to use the template for the build: cp make.def.template make.def
  • Then run this command to use the template for the suites: cp suite.def.template suite.def
  • You now need to customize the make.def file to your system. Your modifications should be the same as mine if you are running Linux (Linaro) on ARM. Scroll through the file and adjust the lines as below:
MPIF77 = mpif77
FFLAGS = O3
MPICC = mpicc
Un-comment include ../config/make.dummy

  • Note that we uncommented the make.dummy file. This means that true MPI will not be used, and all of the benchmarks will only run with single processor as a simple test.
  • The template suite.def file is fine for this proof-of-concept.
  • Return to the root directory of NAS with ../
  • Type make suite and wait for the build to complete. If something goes wrong there may be an issue with a dependency.

Installation (Multi-Processor MPI)

To install a true MPI version, follow the steps above, except leave the make.dummy commented. You should also modify the suite.def file to suit the number of processors (processes) you would like to run.

To run a multi-processor version type:
mpirun -np 4 ./bin/ep.S.4
For a 4 processor version of EP with a size of S. Obviously the benchmark must be compiled for the correct number of processors. You need to update the command accordingly.

You can selectively compile a single test at a time. Please see the README.install file - it's really quite simple.

Thursday, 6 March 2014

Does the Fused Multiply-Add (FMA) instruction make a difference?

I discussed this originally in my Cortex-A7 FFTW benchmarks, but I am repeating it in it's own blog post for clarity as I believe it's an important thing to understand.

I noticed that when enabling the FMA capabilities of FFTW, the performance actually decreased. I thought to myself "but the ARM VFPv4 supports FMA so this should be faster that doing separate multiply and add operations..." so I did a little bit of research as to why this is the case.

In the computation of an FFT, two of the common operations are:

t0 = a + b * c
t1 = a - b * c

The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA instruction:

t0 = a
t0 += b * c
t1 = a
t1 -= b * c

Since ARM is a RISC architecture, the instructions are less flexible and generally take a fixed number of operands. For mathematical operations, it makes sense most of the time to use two operands. Because of this limitation, the FMA can still only take 2 operands and so it is used as shown above. Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:

t = b * c
t0 = a + t
t1 = a - t

All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference which is what we see in the FFTW benchmarks, for example. There will be cases when this instruction does indeed make a difference, but it's important to bear in mind what's going on behind the scenes.