Wednesday, 5 February 2014

Nice article on the importance of the SA-CERN program

Nicola Mawson, from ITWeb has written a nice appraisal of the importance of the role of the South Africa - CERN program and its benefits to the country.

http://www.itweb.co.za/index.php?option=com_content&view=article&id=70626:SA-reaps-CERN-rewards

SA's involvement with the Large Hadron Collider (LHC), in Switzerland, is paying dividends as the country embarks on new electronics and physics projects, and benefits from knowledge gained at the European Organisation for Nuclear Research (CERN).
Thanks to collaboration on the project – the £2.6 billion "Big Bang" particle accelerator and the globe's largest experiment – South African universities are developing technology in fast electronics, supercomputing and plastics.
The LHC at CERN led to the discovery, in 2012, of what has become accepted as the elusive Higgs boson. This discovery is anticipated to catapult physics into a new era, as it will be able to probe previously untouched areas, such as dark matter and dark energy.
Click here
The Higgs particle – or boson – is named after Peter Higgs, who was one of six authors who theorised about the existence of the particle in the 1960s. It is commonly called the "God Particle", after the title of Nobel physicist Leon Lederman's "The God Particle: If the Universe Is the Answer, What Is the Question?", according to Wikipedia.
Locally, about 70 South Africans are involved in the global project and, while the team is small in comparison to those from other countries, there are substantial benefits coming out of its involvement.
Four universities are participating in the programme: the University of the Witwatersrand (Wits), University of Cape Town (UCT), the University of Johannesburg, and the University of KwaZulu-Natal.
Tom Dietel, a lecturer at UCT, says CERN is a flagship project and is expected to spark interest in science and physics.
Bruce Mellado, an associate professor at Wits' school of physics, says the tertiary institution is involved with projects to develop fast electronics, a new form of plastic, and create a cheap alternative for high-throughput supercomputing.
Mellado explains South Africans have been given the opportunity to tap into CERN's infrastructure at very little cost to the country. He says the country has been "given the benefit of a huge facility without having to pay hundreds of billions for it".
Professor Jean Cleymans, from the UCT's physics department, explains SA's involvement – mostly with the Atlas experiment – is a national project and is not specifically linked to any university. SA is also making contributions to the Alice and Isolde projects, he says. "It's important to have a first step in there."
Cleymans says SA's involvement gives young physicists access to technology, software being developed and knowledge.
Cleymans says "far from being limited to Europe", the project is a worldwide project to contribute to advances at CERN. SA's first step happened in 1992, when it signed its first agreement of interest, he notes.
Mellado explains one of the university's initiatives – named sRod – is a faster electronics board, which will be able to process much more data at faster rates. The board, being developed through collaboration in Europe and SA, is being made in conjunction with SA's Square Kilometre Array (SKA) team.
The multibillion-rand telescope project, hosted by SA, Australia and New Zealand, will collect a staggering amount of data as it probes the universe: the data collected by the SKA in a single day would take nearly two million years to play back on an iPod.
Mellado says the prototype board should be in production locally before winter, which will be a "major milestone for SA". He explains there are increasingly large amounts of data to be analysed, but the price for the hardware is currently a "showstopper".
Being able to get university-made technology commercialised will drop the cost and allow SA's technology industry to develop further, says Mellado. "That's the key to further development."
Wits is also developing a supercomputer under its Mass Affordable Computing project, which Mellado says takes the technology from smartphones and uses it for generic applications such as telecoms and computing. This programme will also be used to aid the SKA's data processing needs. "We can proudly say that we are doing it here."
Such projects are critical to SA's science, says Mellado. "Everything now depends on data processing."
The university is also developing – in collaboration with Sasol – plastic scintillators, with a prototype due in a year or two. Mellado says this material will allow the absorption of light.
The South African consortium launched in 2008 and a few ministerial delegations have visited CERN, says Cleymans. SA's contribution is hosted by iThemba Laboratories, which is a national open laboratory, he adds.
The Department of Science and Technology is funding SA's contribution, says Cleymans, adding that CERN is the first project that has seen South African physics departments team up to collaborate.
CERN is currently temporarily offline in a bid to increase its capacity and explore unknown aspects of physics. In the meantime, South African scientists are helping analyse the data it has collected and are aiding with maintenance.
The experiment will run until 2030 and will be upgraded to 10 times its initial design specification, with the ability to collect 100 times more data.

Monday, 3 February 2014

Wits Honored to be remembered in Madiba's will

Below is a statement from the Vice-Chancellor and Principal of the University of the Witwatersrand:


Dear Colleagues

The University of the Witwatersrand is honoured and deeply appreciative to learn that it is a beneficiary of former president Nelson Mandela’s legacy, and we are indeed humbled that he chose to remember the University in his will.

Wits accepts this generous bequest from one of our most illustrious alumni and commits to using it to address the development of higher education in South Africa, for the benefit of the University and its students, but more importantly to advance and perpetuate the values that our inaugural President has bequeathed to our nation and the world.     

Madiba emphasised the need to address inequality – one of the greatest threats to our young democracy, and Wits is determined to utilise this endowment to tackle this societal peril without delay, through the provision of additional scholarships for our students.  

We understand that this endowment brings with it a tremendous responsibility, given the character and legacy of our great leader and his commitment to the transformative power of education.

Thank you, Tata, for remembering us in your will – you live on in our memory and in our lives. 

Professor Adam Habib
Vice-Chancellor and Principal
University of the Witwatersrand
3 February 2014

Wednesday, 29 January 2014

Release from Wits press office: SA readies for big data storm

Erna van Wyk has written a nice report of the opening day of the workshop:

http://www.wits.ac.za/newsroom/newsitems/201401/22739/news_item_22739.html

Tuesday, 28 January 2014

Photo of Peter Jenni with the Wits High Energy Physics group


Below is a picture of Peter Jenni with member of the Wits High Energy Physics group. Prof. John Carter, the Head of the School of Physics, is also in the photo. Missing in the picture are Oana, Itumeleng, Pablo, Rudolph, Trevor, Kieran and Reto.


The High-performance Signal and Data Processing workshop took off on Monday January 27th. The workshop puts together astronomy, astro- and particle physics to address common issues pertaining to the Big Data problem. Attendance has been very nice with over 135 people registered, out of which 60 are students from different parts of the country. Below is the workshop picture:




Tomorrow start the hands-on sessions with FPGA-based electronics.


Monday, 20 January 2014

Herwig Schopper, former CERN Director General, addresses the LHeC workshop. Impressive!

Herwig chairs the LHeC International Advisory Committee.





The LHeC workshop has started. This is the fourth edition and the best attended so far. In the photo one can see Sergio Bertolucci, the Director of Research and Scientific Computing at CERN, addressing the workshop.



Tuesday, 14 January 2014

LHeC workshop to take place January 20-21st in Chavannes-de-Bogis, Switzerland





The 2014 LHeC workshop follows the publication of its conceptual design report (CDR) and the discovery of the Higgs boson in 2012. Recent LHeC progress concerns the study of the increase of its luminosity to 10^34cm-2s-1, the choice of the frequency of 802 MHz as well as the design towards an energy recovery test facility with up to 1 GeV of electron beam energy. The workshop will discuss progress on the LHeC physics (Higgs, top, heavy ions etc) in conjunction with the evolution and simulation of the detector design. It will consider maximising the ep luminosity in synchronous ep and pp operation at the LHC, including a heavy ion programme, and progress in the design of the ERL test facility at CERN. Very recent developments towards a future multi-TeV proton accelerator at CERN open prospects for ep/eA experimentation at yet enlarged energy. The workshop is open for contributions on the various aspects of the LHeC development and its relation to the LHC. It is prepared by an organising committee in collaboration with the working group convenors and an international advisory committee.

http://indico.cern.ch/conferenceDisplay.py?confId=278903


Thursday, 19 December 2013

FFTW Benchmarks on Cortex-A7

The FFT algorithm has many scientific uses. The most obvious uses are in radio astronomy, for the frequency analysis of signals and is vital to Software Defined Radio (SDR) which is used extensively in the Square Kilometer Array (SKA). In line with the goals of the MAC Project, I am curious about how well an ARM processor (specifically the Cortex-A7) can do FFT - which leads to these benchmarks.

I discovered some existing benchmarks of FFTW done on the Cortex-A8 and A9 by Vesperix here. I used their modified FFTW 3.2.2 for ARM NEON and also ran benchmarks using the latest official version of FFTW: 3.3.3. Both sets of results are presented below with a short discussion afterwards.

I was unable to get the FFTW 3.3.3 NEON version working. I was repeatedly hit by a segmentation fault which I think is due to different memory alignment in the newer NEON and VPPv4 FPU's. I will post these specific benchmarks when the error is resolved.

System Specifications

The tests were run on a Cubieboard2 with the following specifications:
  • Allwinner A20 Dual-Core Cortex-A7 SoC @ ~1GHz
  • VFPv4 and NEONv2 FPU
  • 256 kB L2 Cache
  • 1 GB DDRIII RAM
  • 8GB Class 10 MicroSD Card
  • sunxi kernel 3.4.67+
  • Linaro 13.04 (with GCC 4.7.3)

Benchmark Methodology

I am only presenting the results for a complex 1D FFT with powers of two and non-powers of two. These are the types of FFTs that are most useful to radio astronomy since signal phase and amplitude are represented as a complex number. I ran several sets of benchmarks with various optimisations for comparison, each of which I will describe below.

I first tested the Vesperix FFTW 3.2.2 and then the FFTW 3.3.3. In all cases I used the following configure flags for single precision and the only available timer on the ARM processor:

--enable-single --with-slow-timer

I ran non-SIMD (no NEON) test without any extra flags, and NEON SIMD tests with the flag below:

--enable-neon

I also tried out the fused multiply-add flag since the Cortex-A7 has this instruction in the VFPv4 FPU but I found that this flag actually caused performance to decrease! A short description of why this is can be found in the FFTW 3.3.3 tests section.

--enable-fma

In all cases I modified the configure script to optimise for the CPU with the '-mcpu=cortex-a7' flag. I also modified the configure script to try out different GCC FPU options where appropriate, but in general I am only presenting the fastest results in this post. The options I tried are listed below for reference:

-mfpu=neon
-mfpu=neon-vfpv4
-mfpu=vfpv4-d16
-mfpu=vfpv3-d16

I repeated the tests with NEON on a threaded version of FFTW to see at which point multiple threads (on multiple cores) makes a difference and by how much. To enable the threaded version, FFTW must be recompiled with the flag below. Note that FFTW can be compiled once with this flag and used in both a threaded or unthreaded way.

--enable-threads

I plan on running an MPI version with more threads (4 to 16) on our Cubieboard and Wandboard clusters at a later stage.

I used the script provided by Vesperix to automate the benchmarks. For the threaded tests I modified the script to contain the '-onthreads=2'. The number can be adjusted to suit the number of cores available on the system. The modified script is shown below.

#!/bin/sh
for TYPE in 'c'; do
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (2 threads)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench -onthreads=2 $OPTS ${PLACE}${TYPE}${SIZE}
    done
  done
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (1 thread)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench $OPTS ${PLACE}${TYPE}${SIZE}
    done
  done
done

MFLOPS Result Interpretation:

The result provided by FFTW is 'MFLOPS': this is not true MFLOPS. It is estimated by FFTW based on an assumption of algorithmic complexity for the standard Cooley-Tukey FFT algorithm:

Although not necessarily totally accurate in the classic FLOPS sense, it is calculated the same way in all cases, so it works as a way to compare between runs. For comparison to other algorithms, I would rather use the actual time the algorithm takes to run on a specific FFT size (N).

FFTW 3.2.2 (Vesperix):

Please examine the various graphs below. Clearly, NEON makes quite a large difference and is a 'no-brainer' for any application. I am showing one set of non-power of two benchmarks to illustrate why they should not be used.



I ran a test of the threaded version of FFTW 3.2.2 and the results are promising for a scaled-up system. The Cubieboard2 is only a dual-core system but I plan on running MPI tests with more cores at a future date.



FFTW 3.3.3 (Official):

I was unable to get the NEON version of FFTW 3.3.3 working. I was able to run benchmarks of the scalar version of the code which shows a performance improvement over the 3.2.2 scalar results. I compiled one graph comparing all the different scalar versions, with FMA instructions and without.



Note how the FMA versions have slightly lower performance. In the Benchmark Methodology section I mentioned that the --enable-fma flag actually causes performance to decrease. The reason for this is not intuitive as one would think that a Fused Multiply Add (FMA) instruction would save cycles as it replaces separate Multiply and Add instructions. In the computation of an FFT, two of the common operations are:

t0 = a + b * c
t1 = a - b * c

The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA:

t0 = a
t0 += b * c
t1 = a
t1 -= b * c

Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:

t = b * c
t0 = a + t
t1 = a - t

All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference.

Conclusion

The results from this set of benchmarks are very similar to those attained by Vesperix on Cortex-A9 boards. The multi-threaded version is also significantly better for larger FFT sizes. The results at different FFT sizes are very dependant on the processor implementation details such as cache sizes and memory access times. With smaller FFT's the overhead associated with calculating the FFT is a large factor and this is clearly visible up to sizes of 128.

The scalar results for FFTW 3.3.3 are better than those from 3.2.2 so it is logical to assume that the newer version's NEON performance will be better as well.

Since FFTW works by creating a 'plan' before actually calculating the FFT, it chooses to not use more than one thread in the multi-threaded version before a certain FFT size. This is clearly visible as it chooses to use multi-threading at greater than size 128. The overhead associated with doing this causes the result to be poor at size 256 and based on the results, multi threading should only be enabled for sizes over 1024.

The power usage of the Cortex-A7 processor is lower than that of the Cortex-A9 and so if a large cluster of these devices is used for a computational task such as radio astronomy, one could speculate that it may be worthwhile to use more Cortex-A7's over fewer Cortex-A9's since the performance is similar. 


Friday, 6 December 2013

Current Measurement Board

Since we are interested in power measurements for the different ARM platforms, I decided to quickly design and build a simple current measurement board that we can connect an oscilloscope to to plot current (and hence calculate power with a corresponding voltage measurement on the second channel).

The concept is based on Ohms law: the voltage across a resistor is equal to the resistance multiplied by the current through it. The board design in the schematic caters for a known 0.01 Ohm resistor with a 1% tolerance and a gain of 100 with a 1500 Hz low pass filter. The gain of 100 results in a voltage output that is proportional to the current: 1 A current gives 1 V output. This is so that we can use the oscilloscopes built in multiplication to see power in real-time. The low pass filter is there so that we don't get too much noise on the current measurement, but still enough response to see a spike when we start benchmarks.

I have posted images of the schematic and photos of the finished board. If you would like the Cadsoft Eagle design files, I'm happy to share them - just put a request in the comments below. Something I should also mention is the op-amp is pretty high end, unnecessarily! To be honest, it is a free sample that I gratefully received from Maxim, so I used it...

Specifications:
Input Current: ~50 mA - 5 A
Output Voltage: ~50 mV - 5 V (dependant on how close to the negative / ground rail the op-amp can go)
Frequency Response: -3 dB @ 1591 Hz
Supply Voltage: 5 V (dependant on the op-amp)
Power Loss in 0.01 Ohm Resistor @ 5 A: 0.25 W (one can use a spreadsheet to compensate for this error)