Thursday, 13 November 2014

Press release: LAUNCH OF THE NEW PHYSICS HIGH-THROUGHPUT ELECTRONICS LABORATORY

MEDIA INVITATION FOR PR NEWSWIRE FROM CLIENT WITS UNIVERSITY

ATTENTION: NEWS EDITORS AND SCIENCE REPORTERS

DATE: THURSDAY, 13 NOVEMBER 2014

LAUNCH OF THE NEW PHYSICS HIGH-THROUGHPUT ELECTRONICS LABORATORY

The Wits School of Physics invites all media to the launch of the High-Throughput Electronics Laboratory (HTEL) this Friday, 14 November 2014.
This state-of-the-art new lab and facilities at Wits will be a platform for research and development of high-throughput electronics for the ATLAS detector at the Large Hadron Collider (LHC) at the European Organization of Nuclear Research (CERN). The laboratory is designed to deal with the Big Data problem related to the processing of large amounts of data needed to produce new discoveries, following the observation of the Higgs boson at the LHC.
This work would lead to the massive production of electronic devices by South African industry based on the designs developed at the High-Throughput Electronics Laboratory (HTEL).
Date: Friday, 14 November 2014
Time:  09:30 for 10:00
Venue:  P213 (Honours Presentation Room), School of Physics, Physics Building, Braamfontein Campus East
RSVP: Christina Thinane on 011 717 6848 or Christina.Thinane@wits.ac.za
All media are invited.

About the HTEL:
High-throughput electronics deals with the huge transfer of data at very high rates in challenging environments, such as those with a high level of radiation, possibly event upsets and other factors that may produce data corruption. To “read” this data, very fast decisions need to be made in order to select and modify the large amounts of data at high rates.
This laboratory will first of all serve the needs for upgrade of the ATLAS detector and more specifically, the Tile Calorimeter. This ATLAS sub-detector enjoys strong commonalities in the way data is transferred and how the off-detector electronics are designed.
But it is not only ATLAS that will benefit. South Africa’s flagship big science project, the SKA, also faces the same technological challenges related to high-throughput data flows with fast processing common to that of the ATLAS detector. Prototypes of fast-electronics and computing developed for the ATLAS detector could also be used by the SKA.
A spin-off of the design and prototyping work being done for the ATLAS project is the development of the Massive Affordable Computing (MAC) project. One of the limiting factors for harnessing large computing capabilities is the cost: High-performance computers are not cost-effective and need to be imported to the country. The idea behind this project is to develop prototypes for high-performance computing with cost-effective components for a very wide range of applications in research and industry. 
A first spin-off of the HTEL is the development of a mini-PC, catering to the needs of the educational system of South Africa. Few prototypes are currently available and are being tested at the HTEL. These incorporate power efficient and low-cost technologies. A number of these prototypes will be deployed to schools and universities in January-February for feedback. These mini-PCs could be manufactured in South Africa in quantities large enough to accommodate the needs of the educational system of the country.
Ends
Issued by:
Erna van Wyk
Multimedia Communications Officer | Wits Communications 
University of the Witwatersrand
Contacts: +27 11 717 4023 

Wednesday, 5 November 2014

New NVIDIA Jetson-TK1 Cluster



We recently finished setting up a new cluster, but this time we wanted to go for maximum processing power compared to our existing Wandboard Cluster! We opted for the new NVIDIA Jetson-TK1 development boards which hosts the NVIDIA Tegra K1 System on Chip...

The Tegra K1 is a beast: it has a quad-core ARM Cortex-A15 CPU which runs at up to 2.3 GHz with a fifth low power core which is also a Cortex-A15 except it's clock is limited to a few hundred MHz. There is also a 192 core CUDA (Kepler) GPU on the SoC which, according to the spec sheets, can attain about 350 GFLOPS of performance. Preliminary benchmarks of the CPU indicate that High Performance Linpack scores over 20 GFLOPS in single precision! 

We have built a cluster with 11 boards, which equates to 44 ARM Cortex-A15 cores (~220 GFLOPS), 22 GB RAM, Gigabit Ethernet and around 3850 GFLOPS worth of GPGPU processing power! The entire cluster should consume less than 200 W of electricity under load.

We have started running benchmarks on the cluster and will report the results soon...

Friday, 31 October 2014

Adventures with ARM GCC Auto-Vectorization

I have been experimenting with how well gcc (4.8.2) works with compiling a simple dot product of two vectors with 8 elements in each. I have discovered some interesting things (whether they are bugs or quirks I don't know) and I wanted to share my findings...

The first thing to know is how to generate assembly with gcc. Simply add -S to the gcc or g++ command. I am using Eclipse CDT and I have created several build configurations (some of which generate assembly for inspection).

I have been experimenting with the Eigen Library since it aparently supports ARM NEON. What I found is that it works extremely well with x86 SIMD instructions but not so well with ARM NEON. It seems that well coded C++ is better for NEON that Eigen...

x86 SSE2 with Eigen

movaps      32(%rdi), %xmm0
movaps      16(%rdi), %xmm1
mulps       32(%rsi), %xmm0
mulps       16(%rsi), %xmm1
addps       %xmm0, %xmm1
movaps      %xmm1, %xmm0
movhlps     %xmm1, %xmm0
addps       %xmm1, %xmm0
movaps      %xmm0, %xmm1
shufps      $1, %xmm0, %xmm1
addss       %xmm1, %xmm0

x86 SSE2 without Eigen

movq      (%rdi), %rdx
movq      (%rsi), %rax
movss     (%rdx), %xmm1
mulss     (%rax), %xmm1
movss     4(%rdx), %xmm0
mulss     4(%rax), %xmm0
addss     .LC0(%rip), %xmm1
addss     %xmm0, %xmm1
movss     8(%rdx), %xmm0
mulss     8(%rax), %xmm0
addss     %xmm0, %xmm1
movss     12(%rdx), %xmm0
mulss     12(%rax), %xmm0
addss     %xmm0, %xmm1
movss     16(%rdx), %xmm0
mulss     16(%rax), %xmm0
addss     %xmm0, %xmm1
movss     20(%rdx), %xmm0
mulss     20(%rax), %xmm0
addss     %xmm0, %xmm1

ARM NEON-VFPv4 with Eigen

flds        s0, .L2
ldr         r3, [r1]
ldr         r2, [r0]
flds        s15, [r3]
flds        s14, [r2]
vfma.f32    s0, s14, s15
flds  s6,   [r2, #4]
flds  s7,   [r3, #4]
flds  s8,   [r2, #8]
flds  s9,   [r3, #8]
flds  s10,  [r2, #12]
flds  s11,  [r3, #12]
flds  s12,  [r2, #16]
flds  s13,  [r3, #16]
flds  s14,  [r2, #20]
flds  s15,  [r3, #20]
vfma.f32    s0, s6, s7
vfma.f32    s0, s8, s9
vfma.f32    s0, s10, s11
vfma.f32    s0, s12, s13
vfma.f32    s0, s14, s15

ARM NEON-VFPv4 without Eigen

ldr     r3, [r0]
vmov.f32    q8, #0.0  @ v4sf
ldr     r2, [r1]
vld1.32     {q9}, [r3]!
vld1.32     {q10}, [r2]!
vmul.f32    q10, q10, q9
vst1.64     {d20-d21}, [sp:64]
vld1.32     {q9}, [r3]
vld1.32     {q11}, [r2]
vmov     q12, q10  @ v4sf
vfma.f32    q12, q11, q9
vadd.f32    d18, d24, d25
vpadd.f32   d16, d18, d18
vmov.32     r3, d16[0]

It took quite some effort to get the non-Eigen ARM code to be better then the Eigen code. The "naive" version with a simple dot-product for-loop (shown below) was similar to what Eigen produced. The a and b variables have been __restrict__ed and are pointers to aligned memory.

for (int i = 0; i < 8; i++)
{
    out += a[i] * b[i];
}

The results are not what I would expect. I decided to split these two operations into two separate loops and I got the NEON version shown above! 

float prods[8];

for (int i = 0; i < 8; i++)
{
    prods[i] = a[i] * b[i];
}
for (int i = 0; i < 8; i++)
{
    out += prods[i];
}

I should also add that without -funsafe-math-optimizations, the auto-vectorization doesn't work. I'm going to keep working on it to see if I can shed a few more instructions, but so far so good!

Tuesday, 30 September 2014

HPL Result Comparisons Between Tegra K1 and Other Boards

Hardware Used

Four platforms were used to test various Cortex A Series CPUs. They are described in the table below:


HPL Results

The HPL results shown below are all given in GFLOPS. I included the results at 1GHz for all boards and then at the max frequency. The clock frequencies were as close as possible to 1GHz but the difference was almost negligible. Immediately we notice the Jetson Tegra K1 has approximately 4 GFLOPS more than the Odroid XU+E. This is expected as this is approximately the same ratio as the clock frequency ratio of 2300 to 1600. 
HPL Results for four ARM boards

HPL Efficiency

Similarly to what was done in previous posts I have taken the power measurements of the boards at each frequency and recorded the HPL performance. This gave a nice profile of HPL performance / Watt as a function of clock frequency. The A7 performs fairly poorly but it is a duel core. There are not that many available CPU frequencies on the Wandboard (A9) so we are stuck with just 3 data points but even so we can pretty much see the pattern. The A15-p2 (Odroid, Green) clearly shows the transition between the power saver (Quad A7) and the higher powered A15 which occurs at approximately 600 MHz. The Tegra K1 has a much better power efficiency (Over 2 GFLOPS/Watt at low frequencies). This is impressive but impractical since one would never run these devices at <300 MHz for processing data. What is impressive is that even at 2 GHz the efficiency is still over 1 GFLOPS/Watt.

HPL Squared per Watt

As mentioned in my previous posts the efficiency alone is not that useful. A more interesting feature to look for is the best operating frequency to run these chips to maximise both power consumption and performance simultaneously. This really does give us a nice profile of the boards. It also clearly shows the improvements of the Tegra K1 over the Odroid. What is the main reason is still a little unclear but what we do know is this: The Tegra K1 is a later revision of the Cortex A15. What was changed between the r3p2 and r3p3 revisions is not clear from the ARM website : http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438g/ch01s08s10.html as all they mention are the register values that are changed and "Various engineering errata fixes". I think a more significant reason is the process type used to manufacture the chips. The Odroid was made with the 28 nm HKMG process and the Tegra K1 was made with the 28HPM process. According to TSMC the 28HPM provides more power will maintaining the same power leakage as the 28LP process.


Thursday, 25 September 2014

Prometeo, the next generation test bench for the tile calorimeter upgrade in ATLAS.

During Sep 9.-Sep12, the expert week of the tile calorimeter upgrade, the latest version of Prometeo GUI was published.  After several months designing and consulting from tile calorimeter users, the design has been fixed and working well.

Prometeo is the test bench for the tile calorimeter upgrade in 2022. It is a portable enclosure which is able to certify the tile calorimeter front-end electronics in the maintenance. Prometeo shares the same design of super read-out driver(sROD) and working at the LHC bunch crossing frequency.

The Prometeo GUI is an interface for experts and users to diagnose the front-end electronics. The interface is web and it calls python scripts in the back-end to control the main board in Prometeo. The whole program is running on CERN web service or a mini linux machine inside Prometeo.

The motivation of using the web interface is:
1) Compilation free
2) Compatible of different platforms, PAD, smartphone, or regular PC
3) Robustness in more than 10 years.
4) Compatible to CERN-ROOT software and easy for interface design.

The web interface is much easier than a regular GUI, like Qt/JavaSwing. It is often seen in some router software and it is used well in some HEP tools like MadGraph. The Prometeo GUI combines the web, python, IPbus hardware control together. It is a good mode for a light weight hardware control software.




Tuesday, 23 September 2014

Peak at the New Electronics Lab

Sneak Peak at the New Lab

We have been in the process of building a new electronics lab at Wits. Yesterday, Oscar Kureba, Matthew Spoor and myself moved in all our equipment from the old lab. There is still a lot of work to be done but its part of the fun!




Friday, 12 September 2014

HDAYS 2014 in Santander came to a close

HDAYS 2014 in Santander came to a close today. It's been a blast, as always. Great people and wonderful location. Excellent overview of the riches of Higgs physics at the LHC. Run II promises to be a very exciting period for Higgs physics that will set the stage for future endeavours. HDAYS 2015 is set for September 14th-18th. Below is the group photo



Characterising the TegraK1 Cortex A15

A Benchmark Characterisation of the Cortex-A15-r3p3

I have run HPL and Coremark from the lowest frequency (204 MHz) to the highest (2.3 GHz). At first I thought the performance per Watt would be interesting but as expected the lower the frequency the lower the power consumption so it only shows that the efficiency is best at 204 MHz. A much more interesting value is the Performance/Watt X Performance. This essentially shows us at which frequency the CPU maximises both the performance and the efficiency simultaneously.

I have done this in the past with the Cubieboard2, Wandboard and Odroid (Cortex-A7-r0p4, Cortex-A9-r2p2, Cortex-A15-r3p2) but I only did it with HPL. I was asked if I had tried this with Coremark. It was simple enough but I wanted to see if I got the same profile shape as HPL. The question arose "how do I compare HPL and Coremark together?". Obviously a direct comparison is not possible as they are fundamentally different benchmarks but we can compare the shapes of the graphs. This can be done by normalising the results so that the area under each graph is equal to 1. The units on the y axis are then expressed in inverse frequency (MHz­­­-1). What we get is shown below.


This is really interesting. This shows that the approach of using the Performance/Watt X Performance does indeed describe the characteristic of the CPU. It also is nice to see that it is similar for both benchmarks. We see for both cases that the optimum frequency is at 1.73 GHz to maximise both Performance and Efficiency. We are in the process of putting together a full set of results into a paper. I will add a link as soon as its finished.

Monday, 8 September 2014

Hands on the New Nvidia TegraK1

Setup to Measure Power Consumption of the TegraK1

I have been benchmarking the Nvidia TegraK1 on our new Jetson development board for the past few days. I am still busy putting together all the results (mostly of the Cortex-A15 r3p3 and not the GPU, just yet) and will post them soon. In the mean time here are some pics of the setup for fun.


Using a small PCB with a 0.01 Ohm resistor (Designed and Built by Mitch) to measure power consumption of the entire board.
If you look closely you can see the results in excel :P (Actually you can't it's showing the Odroid sheet in my excel spreadsheet lol)

Sneak Peak

A quick sneak peak of the power measurements. I used HPL to stress test the boards at different frequencies and measured power consumption. The boards are:
  1. Cortex-A7 = Cubieboard2
  2. Cortex-A9 = Wandboard
  3. Cortex-A15-r3p2 = Odroid XU+E
  4. CortexA15-r3p3 = Jetson TK1




Friday, 5 September 2014

ACAT 2014 Workshop

Today was the conclusion of the 16th international workshop for Advanced Computing and Analysis Techniques (ACAT) in physics research. I am extremely glad to have attended this excellent workshop. It was very well organised and sported a wide variety of topics from new computing hardware and techniques to algorithms and even some physics theory, making a well rounded program. The networking and friend-making opportunities were also great.

My talk was an overview of the research the group at Wits has performed so far, as well as details on my progress and plans for the PCI-Express SoC interconnect. There was also a poster from the University of Cape Town, but unfortunately Josh (the author) was unable to attend the conference. I presented the poster in his stead to much interest from the community.

There were two other talks and a plenary by the Barcelona Supercomputing Center on ARM System on Chips. Several other talks also made passing mention of ARMs potential suitability for scientific computing. All of these presentations were met with positive comments and questions from the audience.


I felt honored several times as the organizers and ACAT founders repeatedly mentioned that they had a South African participant (the first apparently)! Our project also received several mentions in the summaries - enhanced by the South African label.


I look forward to attending the workshop again in 18 months time and I hope next time more South Africans will be fortunate enough to attend. Perhaps with the Square Kilometer Array project surging ahead, we can attract the ACAT workshop to South Africa in a few years?