Last week I attended the third annual GPGPU conference at the University of Cape Town. Speakers included two international NVIDIA fellows, John Stone and Manuel Ujaldón from the Universities of Malaga and Illinois, as well as local speaker Bruce Merry who is working on GPU parallel processing with the Square Kilometer Array. Relatively new to the world of GPGPU myself, I was exposed to a privileged amount of expert knowledge. John, Miguel and Bruce represent a well of insight into the field of parallel processing - both technical aspects as well as a familiarity with the politics of GPUs and the microprocessor industry as a whole.
There was a theme throughout the lectures of achieving processor performance excellence through a close relationship between programmer and hardware. Learning to harness the power of potentially millions of independent processing units requires attention to memory management and inter-processor relationships. GPUs have several types of memory, varying by size and speed of access and how the many processing units inside share all that memory must be carefully planned. To get the best possible results, one needs to ensure that as many processing units as possible are active at any given time. However, as John pointed out more than once, one cannot measure performance in GFLOPs only, as this metric does not necessarily represent 'scientific throughput' in a simple way. Perhaps the most valuable aspect of the workshop was being privy to John and Bruce's code and coding thought processes. John has done some interesting work on taking quantum-chemical graphic simulation generation from the realm of static image generation to real-time interactive animations. John and Manuel have been working with GPUs since GPGPU's early days, and as such talked us through both the future and the past of GPGPU. Parallel processing does not follow the CPU pattern of ever faster clock rate and more advanced control architecture, but rather seeks a balance between parallel abstractions, memory management, and speed. A crucial constraint in all high-throughput computing environments, like ATLAS, is I/O throughput. Future architectures, transfer protocols (like NVLink),and processor cluster configurations aim to improve this parameter - they themselves are a function of technology as well as the cooperation of big businesses who do not necessarily benefit from cooperating with competitors.
This information is being put to work in my research into the role GPGPU will play in the high-throughput TileCal detector. The GPU(s) I will use will be working in tandem with ARM processors, which amongst other benefits, have an open-source architecture allowing for more versatile and customised designs. If you're interested in GPGPU techniques and their role in ATLAS, keep an eye on this blog for more parallel processing posts.