Australians develop a "supercomputer" - November 2018

Outer Front Cover
Contents
Publisher's Letter: Are electronic medical records privacy concerns overblown?
Feature: Which tiny country is about to launch a lunar lander? by Dr David Maddison
Project: Oh Christmas tree, oh Christmas tree... by Tim Blythman
Project: USB digital and SPI interface board by Tim Blythman
Feature: Australians develop a "supercomputer" by Geoff Graham
Serviceman's Log: It's torture having a broken phone by Dave Thompson
Project: Insomnia and Tinnitus killer by John Clarke
Feature: El cheapo modules, part 20: two tiny compass modules by Jim Rowe
Project: GPS-synched, lab-quality frequency reference (Part 2) by Tim Blythman and Nicholas Vinen
Product Showcase
Subscriptions
Vintage Radio: The 1939 HMV 904 5-inch TV set and 3-band radio receiver by Dr Hugo Holden
PartShop
Market Centre
Advertising Index
Notes & Errata: Super Digital Sound Effects Module, August-September 2018
Outer Back Cover: Trio Test & Measurement - Siglent test equipment

This is only a preview of the November 2018 issue of Silicon Chip.

You can view 41 of the 104 pages in the full issue, including the advertisments.

For full access, purchase the issue for $10.00 or subscribe for access to the latest issues.

Purchase a printed copy of this issue for $10.00.

A Home-Grown Aussie Supercomputer DownUnder GeoSolutions' supercomputer in Perth is up there with some of the fastest in the world, and it was all done in Australia by Australian engineers and physicists. This story isn't just about a supercomputer; it's also about the hunt for oil and gas deposits underground using seismic surveys. I t might not always be apparent but the power of computers, and supercomputers in particular, is growing at a staggering pace. Three years ago, in the July 2015 issue, we reported on the Pawsey Supercomputing Centre in Western Australia that housed Magnus, a supercomputer capable of 1.6 petaflops (1.6 million billion floating point operations per second) – see siliconchip. com.au/Article/8704 But it has already been overshadowed by a home-grown computer built by DownUnder GeoSolutions (DUG) also in Perth, Western Australia, which has a theoretical speed of 22 petaflops. That's 22,000,000,000,000,000 calculations per second! Since the two computers are optimised for different roles, it's difficult to directly compare them. But by any measure, the DUG supercomputer is very fast. And it was built in-house 36 Silicon Chip at a fraction of the cost of the Pawsey facility. It's hard to get your head around how much computing power a petaflop represents. Think of it this way: the DUG supercomputer does its calculations about a million times faster than your desktop computer could. So a calculation that would take the supercomputer one minute would take two years on your computer. To build a supercomputer of this power, you need to be innovative. DUG are using standard hardware with Intel's top-of-the-line processor designed for cluster computing, the Intel Xeon Phi. What's innovative is that these are submerged in huge tanks of dielectric fluid which draw the heat away By Geoff Graham Australia’s electronics magazine while providing near-perfect electrical insulation. If you have a limited budget, you also need to be pragmatic, so the Intel chips are mounted in standard server racks (immersed in the fluid) and a standard 10Gb/s network is used to interconnect them. This is all housed on the ground floor of an ordinary office building in West Perth. DownUnder GeoSolutions specialise in analysing geophysical seismic data and, using their enormous computing power, they can generate accurate three-dimensional maps of the rock strata under the surface. These allow geoscientists to precisely locate possible oil and gas deposits, potentially saving hundreds of millions of dollars in failed drilling attempts. Seismic Surveys The technology behind seismic sursiliconchip.com.au Each of the DUG supercomputer facility's fluid-filled tanks hold up to 80 rack-mounted high-performance servers. At the left end of each tank, you can see the heat exchangers which transfer heat from the dielectric fluid to circulating water which dumps the heat into the atmosphere via radiators, cooled by evaporating water. Credit: DownUnder GeoSolutions veys is just as interesting as the supercomputer used to process the data. In simple terms, sound waves are created in the rock and the reflections (or echoes) from the layers under the surface are recorded. This can be done on the ocean or on land and the work that DUG does is evenly split between the two. A marine survey involves an oceangoing survey vessel towing multiple lines of hydrophones behind it. These are called streamers and there could be up to ten streamers, each up to 12km long, with as many as 10,000 hydrophones being towed. Every ten seconds, a sequence of air guns on the rear of the boat fire, creating a shaped sound wave through the water. When this wave hits the sea bottom, part of it travels through to the various rock layers underneath and on hitting them, is reflected back to the hydrophones. siliconchip.com.au Considering the huge number of multiple reflections from the ocean bottom and rock layers, and that there can be up to 10,000 hydrophones, and that this repeats every ten seconds, you get a sense of the mass of data that is recovered. A full survey can take months of continuous seismic shots so the DUG supercomputer must process hundreds of terabytes of data and condense it into something meaningful. This is why they needed to build one of the fastest supercomputers in the world. Even with their awesome computing power applied to the task, processing the data from a single survey can take months. A land survey typically results in a smaller data set but it can require more intense number crunching. In this case, microphones are planted in the soil and a truck will thump (or vibrate) a huge iron plate placed on the Australia’s electronics magazine ground. The ground reflections are recorded and the truck moves a short distance to thump again. Land surveys generally cover a small area but the density of data recorded can be very large so these also take a lot of supercomputing time to process it. Processing the data Because of the amount of data involved in a survey (hundreds of terabytes up to a few petabytes), it is not feasible to transfer the data over the internet or communications lines. Instead, it is recorded onto many tape cartridges of up to 10TB each and couriered to the processing centre. You could call it an alternative high-bandwidth network (often referred to as a "sneakernet"!). The first task is to eliminate noise in the data created by ocean waves, wind, surface conditions etc and specialised software routines are used for this. November 2018 37 Then the multiple reflections from the surface and other layers need to be merged and more specialised routines are employed for this. The data analysis and reduction then commences, using many mathematical techniques such as Kirchhoff migration, reverse time migration and full waveform inversion. As part of the processing, DUG's own specialist geophysicists will calibrate the processing parameters to achieve the best result, which can highlight and locate the various rock strata to within one metre. The ultimate output is a high-resolution 3D image and velocity model of the various underground layers which the customer's geoscientists can use to locate the optimum drilling locations (see below). At a cost of up to $100 million per drill hole, the savings of having an accurate picture of the underground geology can be huge. Without accurately processed and imaged seismic data, an oil and gas exploration company could waste a lot of money on failed drilling attempts. As with all supercomputers these days, the DUG supercomputer comprises thousands of individual processors, each of which is given a small segment of the overall job to work on. A supervisor program running on a separate computer allocates these subjobs and tracks when each is completed. It then assembles all these individual results into the complete picture. is a heat exchanger which transfers heat from the fluid to circulating water, which in turn dumps the heat into the atmosphere via outside radiators, which are cooled by evaporating water. A more traditional computer installation uses fans in each server unit to transfer the heat to the air and then large aircon units to extract the heat from the air. The fans alone consume a lot of power and the air conditioners are not very efficient so quite a lot of energy (which equates to money) is wasted in just removing the heat. When you enter the room housing the DUG supercomputer, this point is driven home by the relative quiet in the room. A traditional data centre is deafening with thousands of fans pushing the air around but inside the DUG computer room there is just a subdued hum of ancillary equipment – the many servers doing the real work are strangely silent. Power efficiency When you consider the advantages of immersion, cooling you wonder why more supercomputers do not use the technique. For a start, with a power bill of millions of dollars a year, cutting that bill by 45% makes a huge difference. The energy efficiency of data centres is commonly rated by a measure called the Power Usage Effectiveness (PUE) which typically is between 1.2 for a very efficient site to 1.4 for a more normal data centre. That means that 20% to 40% of the power entering the data centre is being used for cooling, lights and other ancillary equipment. The DUG supercomputer centre achieves a PUE of 1.04 which is close to the theoretically perfect score of 1.0. Another advantage of the fluid bath is that all components of the server are held at an even 33-36°C. Nothing is heat stressed, especially the processors which can run much faster due to the fluid being so good at transporting the heat away. The fluid also stops oxidation of all electrical joints (for example, the memory sockets) and prevents dust gathering on components; so they fail less often, resulting in better reliability. About the only downside of the full immersion cooling technique is the rather messy job of removing a server unit for repair or upgrade. The fluid has a low viscosity, so a small amount goes a long way – but at least it is nontoxic and there are always plenty of paper towels on hand. Innovative cooling The basic computing unit in the DUG supercomputer is a "tank". This is a large iron tank, painted bright orange and filled with hundreds of litres of polyalphaolefin (PAO) dielectric fluid. This is a synthetic base oil stock used in the production of high-performance lubricants. It looks and feels like a clear oil but it is non-toxic, nonflammable, biodegradable, has low viscosity, and most importantly, is an excellent insulator. Each tank holds up to 80 rackmounted high-performance servers which are immersed in the fluid. This includes the Ethernet connections, the power supply, 230VAC mains cables etc. The whole lot is completely submerged in the fluid. The fluid is a far better conductor of heat than air and removing the heat from thousands of processors is not an easy task. Immersed in each tank 38 Silicon Chip A close-up of the servers silently computing in their liquid heaven. They are immersed in a polyalphaolefin dielectric fluid, a synthetic base oil stock used in the production of high-performance lubricants and is an excellent electrical insulator. Credit: DownUnder GeoSolutions Australia’s electronics magazine siliconchip.com.au Server units In the DUG supercomputer, each processor (an Intel Xeon Phi – see explanatory panel) is housed in a standard rack-mounting server unit manufactured by companies such as SuperMicro, Gigabyte and Intel. DUG removes the fans and the thermal paste on the central processing unit (CPU) but otherwise, they are standard offthe-shelf units. Then the whole lot is submerged in the dielectric fluid. It is quite unsettling seeing the mains power cord dive into the fluid but it is such a good insulator that everything works perfectly. As you peer into the tank, you can see down in the depths various LEDs on the motherboards still blinking on and off as the CPUs silently compute in their liquid heaven. The processor currently used by DUG is the Intel Xeon Phi 7250 and they use so many of this series of chips that DUG has become Intel's largest commercial customer for them. The Phi processor is designed for use in supercomputers, servers and workstations, and with a retail price of about 2,000 USD and up each, it isn't cheap. The Xeon Phi's most important characteristic is that it has the hardware for doing operations on arrays of floating point numbers (add, multiply etc) – each core can do up to 64 floating point operations per clock cycle. Most of the work in analysing the survey data uses just these functions, so the fact that they are implemented in silicon (versus software) is a significant speed advantage. The Xeon Phi grew out of an earlier design by Intel for a GPU (Graphics Processing Unit) and it shares many of these characteristics. GPUs from companies such as Nvidia are popular in many supercomputing applications because they are effective at operating on arrays of numbers. The difference with the Xeon Phi is that these operations are in floating point (most GPUs can do floating point operations but generally only on "single precision" values) and the chip can also run standard software such as Linux, so a separate "standard" processor is not needed to control it. Each chip contains up to 72 processing cores, running at up to 1.6GHz with super high-speed memory. With the hardware floating point and array processing power, it is very efficient at processing the sort of data that DUG works with. With about 8,000 of these in their supercomputer, they have a lot of processing power. The immersion cooling also offers another advantage: because of its efficient removal of heat, the chips can run forever at their top turbo speed without throttling back due to excessive temperatures, as would normally be the case with air cooling Networking Each server is connected to a 10Gb/s Ethernet network via standard, off the shelf Ethernet switches. Because each processor can spend a lot of time working on just one job (up to a week), the demands on the network are not huge even though there are a lot of connected processors. Note that other supercomputers use much faster and more complicated networking arrangements for good reason; there are certain computing jobs which involve lots of inter-node communications and they would run slow on DUG's network; but that is not what the DUG computer was designed to do. Throughout the network, the operating system used is a heavily modified version of Linux. The non-critical sections of the processing software are written in Java but the time-critical sections are written in optimised C. It is worth remembering that all of this, including the all-important software, was developed and built inhouse. This supercomputer is pragmatically designed using standard components and is not the product of a wellfunded government program. Innovation DownUnder GeoSolutions must rate as one of Australia's most innovative companies. Started by two friends fifteen years ago in a garage (as most great companies seem to do) they have grown to be the third-largest company in their field, with 350 employees; mostly specialists, such as geophysicists, mathematicians, physicists and software developers. They have offices worldwide and supercomputer facilities in Houston, London, Kuala Lumpur and of course The supercomputer outputs a high-resolution 3D image and velocity model of the various underground layers which the customer's geoscientists can use to locate the optimum drilling locations. At a cost of up to $100 million per drill hole, having an accurate picture of the underground geology is important. Credit: DownUnder GeoSolutions siliconchip.com.au Australia’s electronics magazine November 2018 39 A marine survey vessel towing multiple lines of hydrophones. There could be up to 10,000 hydrophones being towed. Every ten seconds, a sequence of air guns on the rear of the boat fires, creating a shaped sound wave through the water which reflects off the sea bottom and rock strata underground. Credit: Western-Geophysical-Seismic The survey vessel creates a sound wave through the water which reflects off the sea bottom and rock strata underground, back to the hydrophones being towed behind the vessel. A full survey can take months of continuous seismic shots so the DUG supercomputer must process terabytes of data. Credit: KrisEnergy Ltd 40 Silicon Chip Australia’s electronics magazine siliconchip.com.au Full waveform inversion (FWI) is a technique to create high-resolution velocity models, in this case on a seismic waveform. The purpose of this transformation is to use the velocity model (data from the seismic survey) to determine what the underground structure would look like. The photos above show an initial velocity model (left) and then after FWI (right), the result being much closer to the actual seismic data. The FWI technique used by DUG possibly makes use of a finite difference scheme or solutions to the Helmholtz equation among other mathematical techniques to determine the behaviour of the non-linear system (see www.researchgate.net/publication/268632261_Full_Wave_Inversion). Image source: www.dug.com/services/full_waveform_inversion_fwi/ Perth, which is their largest supercomputer and also their headquarters. Despite the cooling off of Australia's resources sector, Perth is still one of the world's premier centres for mining, oil and gas exploration. As an illustration, it is estimated that 70% of the world's mining software is developed in Western Australia. Perth also services the many oil and gas companies exploring the North West Shelf fields as well as other reserves such as in Bass Straight. Houston in the USA (Texas) is another world centre for oil and gas exploration and London is a major financial centre, as well as servicing the North Sea. Often, the data produced by the exploration teams is restricted to one part of the world due to sovereignty and security concerns and this is one reason why DUG needs four supercomputing centres. Another reason is that the company works closely with its clients when analysing the data and it is handy to be close to them. What's in the future for DUG? With 56 tanks and about 8,000 processors, the West Perth supercomputer facility rates somewhere in the top 50 or so known supercomputers in the world. Shadowy government intelligence agencies such as the NSA or our own Australian Signals Directorate likely have even more powerful supercomputers for jobs like cracking encrypted messages, but the secrecy involved means that we do not know of them. However, commercial pressures continually demand more processing power. One of the more important prosiliconchip.com.au cessing techniques called Full Waveform Inversion (FWI) demands enormous computing time. An important FWI parameter is frequency measured in hertz and processing is commonly done at 5Hz to 25Hz but DUG want to drive towards 125Hz. The problem is that when you double the frequency, you need 16 times the computer power to get the full benefit. A higher resolution would result in much higher accuracy 3D imaging and models and these would be eagerly received by DUG's customers and provide a clear advantage in this competitive industry. To attain this target, DUG is planning to build a 722 tank facility in Houston. Compare this to the 56 tank (approximately 8,000 processor) supercomputer in Perth and you can see the vastness of the task. When completed, the Houston supercomputer could be one of the five largest known supercomputers in the world. Other than the multitude of tanks and processors involved in the proposed Houston facility, there are many other challenges to be overcome. These include the network bandwidth required and the practical problem of managing and tracking the status of so many processing units. The reason why Houston was selected for this supercomputer is simple: the cost of electricity. In Perth, the commercial cost of power is about 15c/kWh while in Houston, it is 4.7c/ kWh. With an annual power bill in the tens of millions of dollars, that makes a huge difference. Regardless, the supercomputer will be designed and managed in Australia and that is something that all Australians can be proud of. The world's top supercomputers 1 Summit (122 petaflops) Summit is an IBM-built supercomputer running at the US Department of Energy’s Oak Ridge National Laboratory. It has 4608 nodes, each with by two IBM Power9 22-core CPUs and six Nvidia Tesla V100 GPUs. 2 Sunway TaihuLight (93 petaflops) This is a supercomputer developed by China’s National Research Center of Parallel Computer Engineering & Technology and installed at the National Supercomputing Centre in Wuxi (Jiangsu province). It uses 40,960 Chinese-made SW26010 256-core CPUs (plus four auxiliary cores) running on a custom operating system. 3 Sierra (71 petaflops) Sierra is an IBM supercomputer at the USA Lawrence Livermore National Laboratory. It has an architecture similar to that of Summit, with each of its 4320 nodes containing two Power9 CPUs plus four Nvidia Tesla V100 GPUs. By way of comparison, the DownUnder GeoSolutions supercomputer in West Perth has a theoretical performance of 22 petaflops. Unlike the above-listed supercomputers, this has never been tested, simply because running the benchmark would take about seven days and that would be expensive for DUG in terms of lost production. (source: www.top500.org) Australia’s electronics magazine November 2018 41 What is the Intel Xeon Phi? Xeon is the name given to Intel's line of processors intended for servers. Many Xeon processes are essentially just "beefed up" versions of their desktop processors, with higher clock speeds, more cores and so on. But the Xeon Phi is a different beast altogether as it is specifically intended for use in computer clusters. A typical laptop or desktop processor these days contains 2-8 processing cores (in some cases, more). There are two main uses for multiple processing cores: either when you are running more than one application at a time, in which case each application can run on its own dedicated core, or for applications optimised for multi-core processors, where they can split up their workload across multiple cores. But multi-core optimised applications are the exception rather than the rule, partly due to the significant extra complexity required to split the work up amongst the cores, and partly due to the fact that some tasks are easier to split up than others. Generally, it is very slow, computation-heavy tasks which are optimised for multiple cores. For example, video compression or 3D rendering. Both of these tasks can take hours or days to complete and both are relatively easy to split up into smaller jobs (for example, compressing or rendering one quadrant of the video frame). So optimising them for multi-core processors makes a lot of sense. But since so many applications are essentially "single-threaded" and will only occupy one core, laptop and desktop (and phone/tablet) processors are generally optimised for "straight-line speed", which requires a high clock rate and the ability for a core to execute as many instructions simultaneously as possible. Multi-core optimisation However, if you need to perform a huge number of computations then it starts to make sense to design the software to take advantage of more than a few cores. You want to split the job up across hundreds or thousands of processors. And in that case, the ideal processor design starts to look quite different. For a start, the processor clock speed and "straight-line" execution speed are no longer important. If you can design a processor with twice as many cores, where each core runs at 60% of the speed, then you will have gained 20% additional total performance. That's assuming that splitting the job up between more cores has a very low overhead; as usual, there is a point of diminishing returns. And lower clock speeds usually provide higher power efficiency, ie, more work done per watt consumed/dissipated. And that means less cooling; in many cases, heat dissipation/cooling is actually the limiting factor in computing density. So improving computational efficiency can result in a faster cluster. Also, if reducing clock speed means that you can fit more cores on a single die, that's also a boon for inter-core communications, since communication with a core on the same die is much faster than communication with a core on another die, which in turn is Close-up photo of the die for the 72-core version of the Xeon Phi used in the DUG supercomputer. Image source: https://seekingalpha.com/article/3738586-intel-selling-stack-knights-landing 42 Silicon Chip Australia’s electronics magazine siliconchip.com.au much faster than communication with a core in a separate chip or in a different chassis. And depending on the type of computations being made, it may be the case that communications are the limiting factor on performance, not raw number-crunching ability. So for all these reasons and more, if you design a processor from scratch to be used in a cluster-type environment, its performance in that role can be dramatically improved. Enter Xeon Phi Like a standard Xeon, and most Intel desktop/laptop processors, the Phi executes x86-64 code. That makes it easy to develop software for. But it has many more cores than a typical processor; the number varies with the exact version but there are usually 64-72 cores per processor. This specific line of Xeons, codenamed Knight's Landing, utilises Intel Atom cores (Silvermont) with many major modifications to the architecture. The Atom line of chips is known primarily for low-power, low-voltage applications like laptops and systems on a chip (SoC). These cores also have "hyperthreading" type technology, which allows around 256 threads of code to be executing simultaneously, however, since many of these share execution units, the overall increase in computing power from this threading feature is modest. Hand-optimised code potentially performs better with hyperthreading disabled. Clock speeds range from just over 1GHz up to 1.7GHz in the latest models. Each chip has a relatively large amount of shared cache memory (around 34MB) along with smaller caches dedicated to each core. Their external RAM interfaces are two-tiered, with up to 16GB of very fast MCDRAM (400+GB/s; normally mounted inside the chip) and up to 384GB of DDR4 (102.4GB/s; six channels on the motherboard) per chip. All this results in a speed rating of around 3 teraflops per processor, with a dissipation of around 230W. The power efficiency is 13.04GFLOPS/W (3TFLOPS ÷ 230W). Compare that to a standard high-end Xeon, for example, an ES2697A v4 which has 16 cores, runs at up to 3.6GHz and dissipates up to 145W, giving a performance of around 480-640GFLOPS (depending on how it's measured). That gives a power efficiency figure of 4.4GFLOPS/W (640GFLOPS ÷ 145W) for a retail price of 3000 USD. When a supercomputer cluster's power consumption is measured in the megawatts (and with the price of electricity these days), you can see how the much higher power efficiency of the Phi processor – around three times that of the standard Xeon – would be a great benefit. Part of the reason for this improvement is the fact that not only does the Phi have many more lower-clocked cores but they are capable of doing more operations per clock with highly parallel instructions. AVX-512 Instruction set Modern standard Xeon processors support the AVX2 SIMD (single-instruction, multiple-data) instruction set, which allows for up to four single-precision floating point or two double-precision floating point operations to be executed per pipeline. The Xeon Phi processors used by DownUnder GeoSolutions support AVX-512 instructions, which can perform eight single-precision floating point or four double-precision floating point operations per pipeline. Note that in both cases, each core has multiple floating point pipelines and each processor has a large number of cores. siliconchip.com.au The architecture for the 7XXX series Intel Xeon Phi. All versions have 38 tiles (2 cores each) to help with yield recovery. This means defective tiles can be deactivated and thus the chips can be sold as cheaper variants. The CPU can execute instructions out-of-order, which typically provides faster execution than an in-order CPU. Note that in-order CPUs are more predictable in how they execute code, so optimisation is easier. Image source: https://software.intel.com/en-us/forums/ intel-many-integrated-core/topic/742945 So the number of calculations that can be processed per clock is huge, and the number of clock cycles per second is counted in the billions. So it's no wonder that these chips can perform a huge number of calculations per second; a large cluster can contain thousands of such chips. Some of the important instructions supported by this CPU include: PREFETCHWT1 – Prefetch cache line into the L2 cache with intent to write VEXP2 {PS,PD} – Approximate 2n with maximum relative error of 2-23. Used on transcendental sequences. VRSQRT28 {PS,PD} – Approximate reciprocal square root (1 ÷ √x) with maximum relative error of 2-28 before rounding. Used in digital signal processing to normalise a vector. The Xeon Phi is being discontinued by 2019, with the 10nm refresh cancelled and the current product line no longer being sold or replaced after 2019. This is likely due to competition from Nvidia, production woes in shrinking the fabrication processes and/or due to their push again to produce a discrete graphics processor unit (GPU). For more information, see the Xeon Phi Wikipedia page: https:// en.wikipedia.org/wiki/Xeon_Phi Intel's developer page on Xeon Phi is at: siliconchip.com.au/ link/aal4 SC Australia’s electronics magazine November 2018 43