This is only a preview of the November 2018 issue of Silicon Chip. You can view 41 of the 104 pages in the full issue, including the advertisments. For full access, purchase the issue for $10.00 or subscribe for access to the latest issues. Items relevant to "Oh Christmas tree, oh Christmas tree...":
Items relevant to "USB digital and SPI interface board":
Items relevant to "Insomnia and Tinnitus killer":
Items relevant to "El cheapo modules, part 20: two tiny compass modules":
Articles in this series:
Items relevant to "GPS-synched, lab-quality frequency reference (Part 2)":
Purchase a printed copy of this issue for $10.00. |
A Home-Grown
Aussie
Supercomputer
DownUnder GeoSolutions' supercomputer in Perth is up there with
some of the fastest in the world, and it was all done in Australia
by Australian engineers and physicists. This story isn't just about
a supercomputer; it's also about the hunt for oil and gas deposits
underground using seismic surveys.
I
t might not always be apparent but
the power of computers, and supercomputers in particular, is growing at
a staggering pace.
Three years ago, in the July 2015
issue, we reported on the Pawsey
Supercomputing Centre in Western
Australia that housed Magnus, a supercomputer capable of 1.6 petaflops
(1.6 million billion floating point operations per second) – see siliconchip.
com.au/Article/8704
But it has already been overshadowed by a home-grown computer built
by DownUnder GeoSolutions (DUG)
also in Perth, Western Australia, which
has a theoretical speed of 22 petaflops.
That's 22,000,000,000,000,000 calculations per second!
Since the two computers are optimised for different roles, it's difficult
to directly compare them. But by any
measure, the DUG supercomputer is
very fast. And it was built in-house
36
Silicon Chip
at a fraction of the cost of the Pawsey
facility.
It's hard to get your head around
how much computing power a petaflop represents.
Think of it this way: the DUG supercomputer does its calculations about
a million times faster than your desktop computer could. So a calculation
that would take the supercomputer
one minute would take two years on
your computer.
To build a supercomputer of this
power, you need to be innovative.
DUG are using standard hardware
with Intel's top-of-the-line processor
designed for cluster computing, the
Intel Xeon Phi.
What's innovative is that these are
submerged in huge tanks of dielectric fluid which draw the heat away
By Geoff Graham
Australia’s electronics magazine
while providing near-perfect electrical insulation.
If you have a limited budget, you
also need to be pragmatic, so the Intel
chips are mounted in standard server
racks (immersed in the fluid) and a
standard 10Gb/s network is used to
interconnect them. This is all housed
on the ground floor of an ordinary office building in West Perth.
DownUnder GeoSolutions specialise in analysing geophysical seismic
data and, using their enormous computing power, they can generate accurate three-dimensional maps of the
rock strata under the surface.
These allow geoscientists to precisely locate possible oil and gas deposits, potentially saving hundreds
of millions of dollars in failed drilling attempts.
Seismic Surveys
The technology behind seismic sursiliconchip.com.au
Each of the DUG supercomputer facility's fluid-filled tanks hold up to 80 rack-mounted high-performance servers. At the
left end of each tank, you can see the heat exchangers which transfer heat from the dielectric fluid to circulating water
which dumps the heat into the atmosphere via radiators, cooled by evaporating water. Credit: DownUnder GeoSolutions
veys is just as interesting as the supercomputer used to process the data. In
simple terms, sound waves are created in the rock and the reflections
(or echoes) from the layers under the
surface are recorded. This can be done
on the ocean or on land and the work
that DUG does is evenly split between
the two.
A marine survey involves an oceangoing survey vessel towing multiple
lines of hydrophones behind it. These
are called streamers and there could be
up to ten streamers, each up to 12km
long, with as many as 10,000 hydrophones being towed.
Every ten seconds, a sequence of air
guns on the rear of the boat fire, creating a shaped sound wave through
the water. When this wave hits the
sea bottom, part of it travels through
to the various rock layers underneath
and on hitting them, is reflected back
to the hydrophones.
siliconchip.com.au
Considering the huge number of
multiple reflections from the ocean
bottom and rock layers, and that there
can be up to 10,000 hydrophones, and
that this repeats every ten seconds, you
get a sense of the mass of data that is
recovered.
A full survey can take months of
continuous seismic shots so the DUG
supercomputer must process hundreds of terabytes of data and condense
it into something meaningful.
This is why they needed to build
one of the fastest supercomputers in
the world. Even with their awesome
computing power applied to the task,
processing the data from a single survey can take months.
A land survey typically results in
a smaller data set but it can require
more intense number crunching. In
this case, microphones are planted in
the soil and a truck will thump (or vibrate) a huge iron plate placed on the
Australia’s electronics magazine
ground. The ground reflections are
recorded and the truck moves a short
distance to thump again.
Land surveys generally cover a small
area but the density of data recorded
can be very large so these also take a lot
of supercomputing time to process it.
Processing the data
Because of the amount of data involved in a survey (hundreds of terabytes up to a few petabytes), it is not
feasible to transfer the data over the
internet or communications lines.
Instead, it is recorded onto many
tape cartridges of up to 10TB each and
couriered to the processing centre. You
could call it an alternative high-bandwidth network (often referred to as a
"sneakernet"!).
The first task is to eliminate noise in
the data created by ocean waves, wind,
surface conditions etc and specialised
software routines are used for this.
November 2018 37
Then the multiple reflections from
the surface and other layers need to be
merged and more specialised routines
are employed for this.
The data analysis and reduction
then commences, using many mathematical techniques such as Kirchhoff
migration, reverse time migration and
full waveform inversion.
As part of the processing, DUG's
own specialist geophysicists will
calibrate the processing parameters
to achieve the best result, which can
highlight and locate the various rock
strata to within one metre.
The ultimate output is a high-resolution 3D image and velocity model of
the various underground layers which
the customer's geoscientists can use to
locate the optimum drilling locations
(see below). At a cost of up to $100
million per drill hole, the savings of
having an accurate picture of the underground geology can be huge.
Without accurately processed and
imaged seismic data, an oil and gas exploration company could waste a lot
of money on failed drilling attempts.
As with all supercomputers these
days, the DUG supercomputer comprises thousands of individual processors, each of which is given a small
segment of the overall job to work on.
A supervisor program running on a
separate computer allocates these subjobs and tracks when each is completed. It then assembles all these individual results into the complete picture.
is a heat exchanger which transfers
heat from the fluid to circulating water, which in turn dumps the heat into
the atmosphere via outside radiators,
which are cooled by evaporating water.
A more traditional computer installation uses fans in each server unit to
transfer the heat to the air and then
large aircon units to extract the heat
from the air. The fans alone consume
a lot of power and the air conditioners are not very efficient so quite a lot
of energy (which equates to money) is
wasted in just removing the heat.
When you enter the room housing
the DUG supercomputer, this point
is driven home by the relative quiet
in the room. A traditional data centre
is deafening with thousands of fans
pushing the air around but inside the
DUG computer room there is just a
subdued hum of ancillary equipment
– the many servers doing the real work
are strangely silent.
Power efficiency
When you consider the advantages
of immersion, cooling you wonder
why more supercomputers do not
use the technique. For a start, with a
power bill of millions of dollars a year,
cutting that bill by 45% makes a huge
difference.
The energy efficiency of data centres is commonly rated by a measure
called the Power Usage Effectiveness
(PUE) which typically is between 1.2
for a very efficient site to 1.4 for a more
normal data centre.
That means that 20% to 40% of the
power entering the data centre is being used for cooling, lights and other
ancillary equipment.
The DUG supercomputer centre
achieves a PUE of 1.04 which is close
to the theoretically perfect score of 1.0.
Another advantage of the fluid bath
is that all components of the server
are held at an even 33-36°C. Nothing
is heat stressed, especially the processors which can run much faster due to
the fluid being so good at transporting
the heat away.
The fluid also stops oxidation of all
electrical joints (for example, the memory sockets) and prevents dust gathering on components; so they fail less
often, resulting in better reliability.
About the only downside of the full
immersion cooling technique is the
rather messy job of removing a server
unit for repair or upgrade. The fluid
has a low viscosity, so a small amount
goes a long way – but at least it is nontoxic and there are always plenty of
paper towels on hand.
Innovative cooling
The basic computing unit in the
DUG supercomputer is a "tank". This is
a large iron tank, painted bright orange
and filled with hundreds of litres of
polyalphaolefin (PAO) dielectric fluid.
This is a synthetic base oil stock
used in the production of high-performance lubricants. It looks and feels
like a clear oil but it is non-toxic, nonflammable, biodegradable, has low
viscosity, and most importantly, is an
excellent insulator.
Each tank holds up to 80 rackmounted high-performance servers
which are immersed in the fluid. This
includes the Ethernet connections, the
power supply, 230VAC mains cables
etc. The whole lot is completely submerged in the fluid.
The fluid is a far better conductor
of heat than air and removing the heat
from thousands of processors is not
an easy task. Immersed in each tank
38
Silicon Chip
A close-up of the servers silently computing in their liquid heaven. They are
immersed in a polyalphaolefin dielectric fluid, a synthetic base oil stock used
in the production of high-performance lubricants and is an excellent electrical
insulator. Credit: DownUnder GeoSolutions
Australia’s electronics magazine
siliconchip.com.au
Server units
In the DUG supercomputer, each
processor (an Intel Xeon Phi – see
explanatory panel) is housed in a
standard rack-mounting server unit
manufactured by companies such as
SuperMicro, Gigabyte and Intel. DUG
removes the fans and the thermal paste
on the central processing unit (CPU)
but otherwise, they are standard offthe-shelf units.
Then the whole lot is submerged
in the dielectric fluid. It is quite unsettling seeing the mains power cord
dive into the fluid but it is such a
good insulator that everything works
perfectly.
As you peer into the tank, you can
see down in the depths various LEDs
on the motherboards still blinking on
and off as the CPUs silently compute
in their liquid heaven.
The processor currently used by
DUG is the Intel Xeon Phi 7250 and
they use so many of this series of chips
that DUG has become Intel's largest
commercial customer for them.
The Phi processor is designed for
use in supercomputers, servers and
workstations, and with a retail price
of about 2,000 USD and up each, it
isn't cheap.
The Xeon Phi's most important
characteristic is that it has the hardware for doing operations on arrays of
floating point numbers (add, multiply
etc) – each core can do up to 64 floating point operations per clock cycle.
Most of the work in analysing the
survey data uses just these functions,
so the fact that they are implemented
in silicon (versus software) is a significant speed advantage.
The Xeon Phi grew out of an earlier
design by Intel for a GPU (Graphics
Processing Unit) and it shares many
of these characteristics. GPUs from
companies such as Nvidia are popular
in many supercomputing applications
because they are effective at operating
on arrays of numbers.
The difference with the Xeon Phi
is that these operations are in floating point (most GPUs can do floating
point operations but generally only on
"single precision" values) and the chip
can also run standard software such as
Linux, so a separate "standard" processor is not needed to control it.
Each chip contains up to 72 processing cores, running at up to 1.6GHz
with super high-speed memory. With
the hardware floating point and array
processing power, it is very efficient
at processing the sort of data that DUG
works with. With about 8,000 of these
in their supercomputer, they have a lot
of processing power.
The immersion cooling also offers
another advantage: because of its efficient removal of heat, the chips can
run forever at their top turbo speed
without throttling back due to excessive temperatures, as would normally
be the case with air cooling
Networking
Each server is connected to a 10Gb/s
Ethernet network via standard, off the
shelf Ethernet switches. Because each
processor can spend a lot of time working on just one job (up to a week), the
demands on the network are not huge
even though there are a lot of connected processors.
Note that other supercomputers use
much faster and more complicated networking arrangements for good reason;
there are certain computing jobs which
involve lots of inter-node communications and they would run slow on
DUG's network; but that is not what
the DUG computer was designed to do.
Throughout the network, the operating system used is a heavily modified version of Linux. The non-critical
sections of the processing software are
written in Java but the time-critical
sections are written in optimised C.
It is worth remembering that all of
this, including the all-important software, was developed and built inhouse.
This supercomputer is pragmatically designed using standard components and is not the product of a wellfunded government program.
Innovation
DownUnder GeoSolutions must rate
as one of Australia's most innovative
companies.
Started by two friends fifteen years
ago in a garage (as most great companies seem to do) they have grown to
be the third-largest company in their
field, with 350 employees; mostly specialists, such as geophysicists, mathematicians, physicists and software
developers.
They have offices worldwide and
supercomputer facilities in Houston,
London, Kuala Lumpur and of course
The supercomputer outputs a high-resolution 3D image and velocity model of the various underground layers which the
customer's geoscientists can use to locate the optimum drilling locations. At a cost of up to $100 million per drill hole,
having an accurate picture of the underground geology is important. Credit: DownUnder GeoSolutions
siliconchip.com.au
Australia’s electronics magazine
November 2018 39
A marine survey vessel towing multiple lines of hydrophones. There could be up to 10,000 hydrophones being towed.
Every ten seconds, a sequence of air guns on the rear of the boat fires, creating a shaped sound wave through the water
which reflects off the sea bottom and rock strata underground. Credit: Western-Geophysical-Seismic
The survey vessel creates a sound wave through the water which reflects off the sea bottom and rock strata underground,
back to the hydrophones being towed behind the vessel. A full survey can take months of continuous seismic shots so the
DUG supercomputer must process terabytes of data. Credit: KrisEnergy Ltd
40
Silicon Chip
Australia’s electronics magazine
siliconchip.com.au
Full waveform inversion (FWI) is a technique to create high-resolution velocity models, in this case on a seismic
waveform. The purpose of this transformation is to use the velocity model (data from the seismic survey) to determine
what the underground structure would look like. The photos above show an initial velocity model (left) and then after
FWI (right), the result being much closer to the actual seismic data. The FWI technique used by DUG possibly makes use
of a finite difference scheme or solutions to the Helmholtz equation among other mathematical techniques to determine
the behaviour of the non-linear system (see www.researchgate.net/publication/268632261_Full_Wave_Inversion).
Image source: www.dug.com/services/full_waveform_inversion_fwi/
Perth, which is their largest supercomputer and also their headquarters.
Despite the cooling off of Australia's
resources sector, Perth is still one of
the world's premier centres for mining, oil and gas exploration.
As an illustration, it is estimated
that 70% of the world's mining software is developed in Western Australia. Perth also services the many oil
and gas companies exploring the North
West Shelf fields as well as other reserves such as in Bass Straight.
Houston in the USA (Texas) is another world centre for oil and gas exploration and London is a major financial centre, as well as servicing
the North Sea.
Often, the data produced by the exploration teams is restricted to one part
of the world due to sovereignty and
security concerns and this is one reason why DUG needs four supercomputing centres.
Another reason is that the company
works closely with its clients when
analysing the data and it is handy to
be close to them.
What's in the future for DUG?
With 56 tanks and about 8,000 processors, the West Perth supercomputer facility rates somewhere in the top
50 or so known supercomputers in
the world.
Shadowy government intelligence
agencies such as the NSA or our own
Australian Signals Directorate likely
have even more powerful supercomputers for jobs like cracking encrypted messages, but the secrecy involved
means that we do not know of them.
However, commercial pressures
continually demand more processing
power. One of the more important prosiliconchip.com.au
cessing techniques called Full Waveform Inversion (FWI) demands enormous computing time.
An important FWI parameter is frequency measured in hertz and processing is commonly done at 5Hz to
25Hz but DUG want to drive towards
125Hz.
The problem is that when you double the frequency, you need 16 times
the computer power to get the full benefit. A higher resolution would result
in much higher accuracy 3D imaging
and models and these would be eagerly received by DUG's customers and
provide a clear advantage in this competitive industry. To attain this target,
DUG is planning to build a 722 tank
facility in Houston.
Compare this to the 56 tank (approximately 8,000 processor) supercomputer in Perth and you can see the
vastness of the task. When completed,
the Houston supercomputer could be
one of the five largest known supercomputers in the world.
Other than the multitude of tanks
and processors involved in the proposed Houston facility, there are many
other challenges to be overcome. These
include the network bandwidth required and the practical problem of
managing and tracking the status of
so many processing units.
The reason why Houston was selected for this supercomputer is simple: the cost of electricity. In Perth,
the commercial cost of power is about
15c/kWh while in Houston, it is 4.7c/
kWh. With an annual power bill in the
tens of millions of dollars, that makes
a huge difference.
Regardless, the supercomputer will
be designed and managed in Australia
and that is something that all Australians can be proud of.
The world's top supercomputers
1 Summit (122 petaflops)
Summit is an IBM-built supercomputer running at the US Department of Energy’s Oak Ridge
National Laboratory. It has 4608 nodes, each with by two IBM Power9 22-core CPUs and
six Nvidia Tesla V100 GPUs.
2 Sunway TaihuLight (93 petaflops)
This is a supercomputer developed by China’s National Research Center of Parallel Computer Engineering & Technology and installed at the National Supercomputing Centre in Wuxi
(Jiangsu province). It uses 40,960 Chinese-made SW26010 256-core CPUs (plus four auxiliary cores) running on a custom operating system.
3 Sierra (71 petaflops)
Sierra is an IBM supercomputer at the USA Lawrence Livermore National Laboratory. It has
an architecture similar to that of Summit, with each of its 4320 nodes containing two Power9 CPUs plus four Nvidia Tesla V100 GPUs.
By way of comparison, the DownUnder GeoSolutions supercomputer in West Perth has
a theoretical performance of 22 petaflops. Unlike the above-listed supercomputers, this has
never been tested, simply because running the benchmark would take about seven days
and that would be expensive for DUG in terms of lost production. (source: www.top500.org)
Australia’s electronics magazine
November 2018 41
What is the Intel Xeon Phi?
Xeon is the name given to Intel's line of processors intended for
servers. Many Xeon processes are essentially just "beefed up" versions of their desktop processors, with higher clock speeds, more
cores and so on. But the Xeon Phi is a different beast altogether as
it is specifically intended for use in computer clusters.
A typical laptop or desktop processor these days contains 2-8
processing cores (in some cases, more). There are two main uses
for multiple processing cores: either when you are running more
than one application at a time, in which case each application can
run on its own dedicated core, or for applications optimised for multi-core processors, where they can split up their workload across
multiple cores.
But multi-core optimised applications are the exception rather
than the rule, partly due to the significant extra complexity required
to split the work up amongst the cores, and partly due to the fact
that some tasks are easier to split up than others.
Generally, it is very slow, computation-heavy tasks which are
optimised for multiple cores. For example, video compression or
3D rendering.
Both of these tasks can take hours or days to complete and both
are relatively easy to split up into smaller jobs (for example, compressing or rendering one quadrant of the video frame). So optimising them for multi-core processors makes a lot of sense.
But since so many applications are essentially "single-threaded"
and will only occupy one core, laptop and desktop (and phone/tablet)
processors are generally optimised for "straight-line speed", which
requires a high clock rate and the ability for a core to execute as
many instructions simultaneously as possible.
Multi-core optimisation
However, if you need to perform a huge number of computations
then it starts to make sense to design the software to take advantage of more than a few cores. You want to split the job up across
hundreds or thousands of processors. And in that case, the ideal
processor design starts to look quite different.
For a start, the processor clock speed and "straight-line" execution speed are no longer important. If you can design a processor with twice as many cores, where each core runs at 60% of the
speed, then you will have gained 20% additional total performance.
That's assuming that splitting the job up between more cores has a
very low overhead; as usual, there is a point of diminishing returns.
And lower clock speeds usually provide higher power efficiency,
ie, more work done per watt consumed/dissipated. And that means
less cooling; in many cases, heat dissipation/cooling is actually the
limiting factor in computing density. So improving computational
efficiency can result in a faster cluster.
Also, if reducing clock speed means that you can fit more cores
on a single die, that's also a boon for inter-core communications,
since communication with a core on the same die is much faster
than communication with a core on another die, which in turn is
Close-up photo of the die for the 72-core version of the Xeon Phi used in the DUG supercomputer.
Image source: https://seekingalpha.com/article/3738586-intel-selling-stack-knights-landing
42
Silicon Chip
Australia’s electronics magazine
siliconchip.com.au
much faster than communication with a core in a separate chip or
in a different chassis.
And depending on the type of computations being made, it may
be the case that communications are the limiting factor on performance, not raw number-crunching ability.
So for all these reasons and more, if you design a processor from
scratch to be used in a cluster-type environment, its performance
in that role can be dramatically improved.
Enter Xeon Phi
Like a standard Xeon, and most Intel desktop/laptop processors,
the Phi executes x86-64 code. That makes it easy to develop software for. But it has many more cores than a typical processor; the
number varies with the exact version but there are usually 64-72
cores per processor.
This specific line of Xeons, codenamed Knight's Landing, utilises
Intel Atom cores (Silvermont) with many major modifications to the
architecture. The Atom line of chips is known primarily for low-power,
low-voltage applications like laptops and systems on a chip (SoC).
These cores also have "hyperthreading" type technology, which
allows around 256 threads of code to be executing simultaneously,
however, since many of these share execution units, the overall increase in computing power from this threading feature is modest.
Hand-optimised code potentially performs better with hyperthreading disabled. Clock speeds range from just over 1GHz up to 1.7GHz
in the latest models.
Each chip has a relatively large amount of shared cache memory
(around 34MB) along with smaller caches dedicated to each core.
Their external RAM interfaces are two-tiered, with up to 16GB of
very fast MCDRAM (400+GB/s; normally mounted inside the chip)
and up to 384GB of DDR4 (102.4GB/s; six channels on the motherboard) per chip.
All this results in a speed rating of around 3 teraflops per processor, with a dissipation of around 230W. The power efficiency is
13.04GFLOPS/W (3TFLOPS ÷ 230W).
Compare that to a standard high-end Xeon, for example, an ES2697A v4 which has 16 cores, runs at up to 3.6GHz and dissipates up
to 145W, giving a performance of around 480-640GFLOPS (depending on how it's measured). That gives a power efficiency figure of
4.4GFLOPS/W (640GFLOPS ÷ 145W) for a retail price of 3000 USD.
When a supercomputer cluster's power consumption is measured in the megawatts (and with the price of electricity these days),
you can see how the much higher power efficiency of the Phi processor – around three times that of the standard Xeon – would be
a great benefit.
Part of the reason for this improvement is the fact that not only
does the Phi have many more lower-clocked cores but they are capable of doing more operations per clock with highly parallel instructions.
AVX-512 Instruction set
Modern standard Xeon processors support the AVX2 SIMD (single-instruction, multiple-data) instruction set, which allows for up to
four single-precision floating point or two double-precision floating
point operations to be executed per pipeline.
The Xeon Phi processors used by DownUnder GeoSolutions support AVX-512 instructions, which can perform eight single-precision floating point or four double-precision floating point operations per pipeline.
Note that in both cases, each core has multiple floating point pipelines and each processor has a large number of cores.
siliconchip.com.au
The architecture for the 7XXX series Intel Xeon Phi. All
versions have 38 tiles (2 cores each) to help with yield
recovery. This means defective tiles can be deactivated
and thus the chips can be sold as cheaper variants.
The CPU can execute instructions out-of-order, which
typically provides faster execution than an in-order CPU.
Note that in-order CPUs are more predictable in how they
execute code, so optimisation is easier.
Image source: https://software.intel.com/en-us/forums/
intel-many-integrated-core/topic/742945
So the number of calculations that can be processed per clock
is huge, and the number of clock cycles per second is counted in
the billions.
So it's no wonder that these chips can perform a huge number
of calculations per second; a large cluster can contain thousands
of such chips.
Some of the important instructions supported by this CPU include:
PREFETCHWT1
– Prefetch cache line into the L2 cache with intent to write
VEXP2 {PS,PD}
– Approximate 2n with maximum relative error of 2-23. Used on
transcendental sequences.
VRSQRT28 {PS,PD}
– Approximate reciprocal square root (1 ÷ √x) with maximum
relative error of 2-28 before rounding. Used in digital signal processing to normalise a vector.
The Xeon Phi is being discontinued by 2019, with the 10nm refresh cancelled and the current product line no longer being sold
or replaced after 2019.
This is likely due to competition from Nvidia, production woes in
shrinking the fabrication processes and/or due to their push again
to produce a discrete graphics processor unit (GPU).
For more information, see the Xeon Phi Wikipedia page: https://
en.wikipedia.org/wiki/Xeon_Phi
Intel's developer page on Xeon Phi is at: siliconchip.com.au/
link/aal4
SC
Australia’s electronics magazine
November 2018 43
|