This is only a preview of the February 2023 issue of Silicon Chip. You can view 36 of the 112 pages in the full issue, including the advertisments. For full access, purchase the issue for $10.00 or subscribe for access to the latest issues. Articles in this series:
Articles in this series:
Items relevant to "Active Mains Soft Starter, Part 1":
Items relevant to "Advanced Test Tweezers, Part 1":
Items relevant to "Active Subwoofer, Part 2":
Items relevant to "Heart Rate Sensor Module":
Articles in this series:
Items relevant to "Noughts & Crosses, Part 2":
Purchase a printed copy of this issue for $11.50. |
MORE ON
COMPUTER MEMORY
The preceding article provides an overview of modern computer memory
technology, but that technology is complex and would take a great deal of
space to describe fully. We have compiled some interesting facts about the
latest memory technology for those who want to know a bit more.
BY NICHOLAS VINEN
HE TOPICS COVERED
T
IN THIS ARTICLE include
how data is stored in memory, more
details on the differences between
SRAM and DRAM, how DRAM timings vary, the relatively recent development of high-capacity on-CPU
DRAM and some of the new features
included in the latest DDR5 memory
standard.
Memory encoding schemes
Last month’s first article on Computer Memory described how text
could be stored (eg, as ASCII characters). Early computers had so little memory and such limited I/O that
numbers and text were realistically
the only things they could handle.
But of course, these days, computers
store and display so much more. Here
are some other things that can reside
in RAM.
each byte can store two decimal digits,
0-9 and 0-9. This is somewhat wasteful as only 100 different values can be
stored in a byte rather than 256, but
it makes conversion for display easier
and ensures correct rounding of dollars and cents etc.
For decimal numbers, floating point
is the most common storage method.
It is similar to numbers in scientific
notation, such as 6.02 × 1023 or 1.602
× 10-19. This allows the handling of
tiny and huge numbers in the same
amount of space.
Floating point numbers are usually
stored as 32 or 64 bits with one sign bit
(positive or negative), an exponent (the
power to which 10 is raised) and the
mantissa (6.02 or 1.602 in the previous
examples). For 32-bit floating point
numbers (‘single precision’), the exponent is eight bits and the mantissa is 23
bits. For a 64-bit floating point number (‘double precision’), the exponent
is 11 bits and the mantissa is 52 bits.
2. Still Images
In the early days of computer graphics, images were typically stored as a
grid of numbers. The most basic displays are monochrome and can only
turn pixels on or off, so each pixel is
allocated a bit and usually 0=off and
1=on. For greyscale images, each pixel
is assigned a number, possibly a byte.
In that case, 0=black and 255=white
with 254 shades of grey in between.
Colour images usually require
between 16 bits (two bytes) and 32 bits
(four bytes) per pixel. Those bits are
typically split up into three numbers,
one for red intensity, one for green and
one for blue. Those three colours are
A bitmap (“raster”) image
next to a vector version
of the same image. Vector
images scale better than
bitmaps. This is because
bitmap images are created
via filling individual
pixels with a single
colour, while vector
images are composed of
mathematical paths. JPG
is an example of a bitmap
image format, while SVG is
Vector (300% scale)
a common vector format.
1. Numbers
Whole numbers (integers) are usually stored in binary, with one byte
allowing a range of 0-255 or -128 to
+127 to be stored. Two bytes (16 bits)
can store an integer of 0-65535 or
-32768 to +32767, while four bytes
(32 bits) can store 0 to about four billion, or negative two billion to positive two billion.
Financial systems sometimes use
BCD (binary-coded decimal), where
Bitmap (300% scale)
24
Australia's electronics magazine
Silicon Chip
Fixed-point decimal numbers are
sometimes used where speed is more
critical than precision or range. These
are basically integers (whole numbers)
with a fixed scaling factor, eg, 1/1000,
in which case the integer 1234 represents the decimal 1.234.
siliconchip.com.au
mixed in varying proportions to create a range of colours.
Images intended for printing might
use four values: CMYK (cyan, magenta,
yellow & black) rather than RGB (red,
green & blue). High dynamic range
(HDR) images might use even more
bits, up to 16 per attribute or 48-64 bits
per pixel. Usually (but not always), all
the colour information is packed into
an integer multiple of the byte size to
make reading/writing pixels in the
memory buffer easier.
For 16-bit RGB colour images, such
as those used on small TFTs, the 16 bits
are usually allocated 5-6-5, with six for
green and five for red and blue. That’s
because the human eye can distinguish
more shades of green than red or blue.
However, the limited number of 16-bit
colours often leads to ‘banding’ in gradients such as a blue sky, so 24-bit
colour (8-8-8 or better) is preferred.
While bitmaps are conceptually
simple, the trouble is that they are
large. A 4K (3480 × 2160 pixel) image
in RGB with HDR (12 bits per attribute)
would take 3840 × 2160 × 3(RGB) ×
12(bits) = 296.6 million bits or 37.3MB
if stored as a bitmap.
So images are usually compressed
for storage, eg, as PNG (lossless, preserving the original image perfectly)
or JPEG (lossy) files. Still, in memory,
images are usually kept as bitmaps for
fast access.
3. Vector Images
Vector images are generally stored as
one or more shapes bounded by lines
or splines. A spline is an elegant way
to define a curve in 2D or 3D space
using just a few numbers. For lines, it’s
only necessary to know the x & y coordinates of each end of the line, while
splines typically have two endpoints
and two control points.
The coordinates can be integers
(whole numbers), floating-point or
fixed-point numbers (decimals). Along
with the bounding information, there
will usually be colour/pattern information, transparency data etc. The
characters used in fonts are defined
this way, as well as many elements in
files such as PDF (portable document
format), PS/EPS (PostScript) etc.
4. Audio
In memory, audio is usually stored
as PCM (pulse-code modulation).
This is simply a series of numbers
representing the audio signal voltage
siliconchip.com.au
This image shows the motion vectors (as arrows) from a H.264 encoding of
the film Big Buck Bunny (Blender Foundation, Peach Movie Project). Motion
vectors are used to describe how one image can be transformed into another.
These vectors are used to help compress movie formats, see https://w.wiki/62xT
Source: https://trac.ffmpeg.org/wiki/Debug/MacroblocksAndMotionVectors
sampled at regular intervals. The number of points per second is known as
the sampling rate, while the number of
bits allocated to each number is known
as the bit depth. CD-quality audio has
a 44.1kHz sampling rate and 16 bits
per channel (two for stereo).
48kHz is another common sampling
rate. Other rates you might see are onehalf, one-quarter, double or four times
either value (44.1kHz or 48kHz). A bit
depth of less than 16 generally means
noisy audio, while lower sampling
rates also lower audio quality. 24-bit
samples are sometimes used for audio
mastering but are not really necessary
for consumer audio, even hifi.
As with still images, audio files
can take up a lot of memory, so they
are usually compressed when stored,
such as in the FLAC format (lossless)
or MP3/AAC (lossy).
5. Video
In the most basic sense, a video is
just a series of still images (possibly
accompanied by audio). Therefore,
it can be encoded in the same way as
still images but with more than one,
which is the idea behind the (quite
old) Motion JPEG encoding scheme.
The thing is that most video frames
are very similar to the last frame, so the
amount of memory required is drastically reduced by storing the first frame,
then the difference between each subsequent frame.
Think of a video camera being
panned or zoomed; in the case of panning, a frame will be mostly like the
previous frame but shifted slightly.
Australia's electronics magazine
The distance and direction can be
encoded in just a few bytes, compared
to kilobytes or megabytes for a whole
new frame image.
In practice, a complete frame (‘I
frame’) is occasionally stored, mainly
to prevent image degradation over
long periods and allow for seeking
in the video. But most frames are
stored only as differences, primarily
in the form of ‘motion vectors’. Such
encoding schemes include the MPEG
series: MPEG, MPEG-2 and these days,
MPEG-4, which encompasses a wide
range of such algorithms.
For example, digital TV and BluRays mostly use either MPEG-2 or,
more recently, MPEG-4. The audio
part of the video is encoded much the
same as a regular audio file, usually
in chunks between the video frames.
Because video data can take up
so much space, it is generally stored
compressed in this way in both RAM
and more permanent storage. A frame
buffer is initialised with a bitmap of
the first frame. Then, during playback,
the motion vectors are applied to that
buffer to produce a second buffer containing the next frame image. The process then repeats, alternating between
buffers (sometimes more than two).
6. 3D Models
3D models are similar to the vector
images described above, only with a
third dimension. A three-dimensional
‘mesh’ of points, lines and/or splines
describes the shape of an object to be
shown on the screen, such as a person, vehicle, building etc. Flat image
February 2023 25
in memory similarly to mathematical
graphs, allowing the shortest or fastest
route to be computed and directions
to be generated.
SRAM vs DRAM
A 3D polygon mesh of a dolphin.
Source: https://w.wiki/62xp
‘textures’ are mapped onto the faces
of that shape and wrapped around.
Lighting effects are applied to make
the resulting rendered images look
more realistic. Simulated bones can
alter the shape of the mesh to produce
realistic motion; hair and fur effects
can be added on top, and so on, creating a three-dimensional moving image
that, these days, can approach photo-
realistic levels.
Much computation is required to
turn all that data into high-resolution
images in real time, which is why modern graphics processor units (GPUs)
are usually the computer’s most powerful (and power-hungry) part. It’s also
why GPUs tend to have incredibly fast
RAM, sometimes with a total bandwidth exceeding 1000GiB per second!
7. Maps
Maps used for purposes such as navigation are effectively also vector data.
Streets and intersections are joined
and labelled, and ‘metadata’ is added,
such as how many lanes are on a given
road, which ones can turn, whether a
street is one-way etc. They are stored
SRAM memories are simple to use.
To read a byte/word from an SRAM, the
address data is first applied to the chip.
Cascaded logic within the SRAM chip
activates certain lines within, depending on this address, so only the memory cells at that address are enabled.
When the chip’s read-enable line is
activated, the data within those cells
are fed to the data outputs. After a
specific time (usually measured in
nanoseconds), it has stabilised and is
ready to be accessed by the processor.
Writing to an SRAM memory is similar. The address lines are driven to
select the address to be written, and
at the same time, the data to be written is applied to the data input lines
(shared with the data output). When
the write-enable line is activated, the
selected cells within the SRAM will
change their state to match the states of
the data inputs. Again, the cycle time
is usually measured in nanoseconds.
The processor can read and write
addresses in any patterns it needs to,
and the timings do not change. Reads
and writes can proceed at the maximum frequency the chip supports (eg,
100MHz for a 10ns SRAM).
Using a DRAM chip is far more complicated. Rather than having just a few
timings to consider (like the SRAM’s
address and data setup times), a DRAM
has dozens of different timings. That’s
because, to achieve a high density, the
bits in the SRAM chip are arranged in
rows and columns, and only one row
in a bank can be active at a time.
It takes some time to change active
rows. To switch rows, first, the old
row must be deactivated with a PRECHARGE command (and corresponding tRP delay). Then a new row must
be activated with the ACTIVE command, incurring a further delay of
tRCS. Then a column can be read or
written after a further delay of CL.
The tRP, tRCd and tCL delays usually are similar numbers of clock
cycles (eg, around 14 cycles for DDR4).
There is also typically a longer delay
between activating a row and being
able to deselect it. So constantly
switching between rows to read values scattered throughout the memory
is much slower than sequential or random reads within the same row.
A few different approaches are used
to overcome this. One is to have a highspeed SRAM cache within the processor that stores the most commonly
accessed memory locations. That way,
cache lines can be rapidly read or
written to the main DRAM memory in
bursts, taking advantage of the ability
to read and write sequential addresses
in the DRAM quickly.
Also, by having multiple banks
within each DIMM, while one bank
cannot operate due to row switching
delays, data going to/from another
bank can pass over the memory interface. So with enough processor cores
constantly reading and writing different banks, the interface is never idle.
If that seems confusing, don’t worry,
it gets a lot more complicated! Modern DRAM has timing parameters that
include the following: CAS, RCD,
RP, RAS, RC, FAW, RRDS, RRDL and
CCDL. That isn’t even a complete list.
These timings are stored in a small
EEPROM on each DIMM for a range
of clock speeds to allow the memory
controller to be appropriately configured at boot time.
Memory timing commands
An example map taken from OpenStreetMap (www.openstreetmap.org/)
showing a route (in blue) from Circular Quay to the Sydney Opera House.
26
Silicon Chip
Australia's electronics magazine
tCL
CAS latency
tRCD
RAS to CAS delay
tRP
Row precharge time
tRAS
Row active time
For more details, see: https://w.
wiki/62vt & siliconchip.au/link/abi2
siliconchip.com.au
Despite all this data being available,
to achieve the best performance, it’s
still necessary for the memory controller to spend some time ‘training’
the RAM (basically, experimenting
with different timings until it finds
an optimal combination that works).
That is why a newly built computer
can sometimes take quite some time
(tens of seconds) to boot for the first
time, or after a BIOS reset.
One interesting aspect of DRAM
performance to consider is due to the
availability of multiple banks and
the frequent delays in accessing data
within a given bank. Consider a system
with many CPU cores running in parallel, accessing DRAM over a shared
bus. Some cores will be blocked at any
given time, waiting on memory access.
However, at the same time, other
cores may be accessing data stored in
different banks in the DRAM. They
can therefore utilise the otherwise
idle shared bus to transfer that memory. When those transfers complete,
the other banks will likely be ready,
and the bus will be handed over to the
other cores.
Therefore, having many CPU cores
not only increases the total processing
power available but also leads to better utilisation of the memory bus. This
is why sometimes, splitting a task up
among many cores can improve performance even when it is primarily limited by memory performance.
On-package DRAM
Fast on-chip SRAM caches have
been around for a long time, at least as
far back as 1989, when Intel launched
A 2KiB SRAM (Static Random Access
Memory) chip used in a NES clone.
SRAM is significantly faster, but more
costly than DRAM so it’s commonly
used in small quantities such as in the
L1 and L2 cache of a computer CPU
(from a few KiB to a few Mib). Source:
https://w.wiki/63EN
siliconchip.com.au
Table 1 – Apple M1 & M2 RAM configurations
Model
RAM capacity RAM chip
Bus width
Data rate
M1
8GiB or 16GiB LPDDR4X-4266
128 bit
68.3GB/s
M1 Pro
16GiB or
32GiB
LPDDR5-6400
256 bit
204.8GB/s
M1 Max
32GiB or
64GiB
LPDDR5-6400
512 bit
409.6GB/s
M1 Ultra
64GiB or
128GiB
LPDDR5-6400
1024 bit
819.2GB/s
M2
8GiB, 16GiB
or 24GiB
LPDDR5-6400
128 bit
100GB/s
the 80486 processor with 8KiB or
16KiB of internal L1 cache. However,
in November 2020, Apple launched
their first range of full computers using
processors that they designed themselves, dubbed the M1.
These processors and their successors, the M2 series, are unique in
today’s market because they do not use
external DRAM for storage. Instead,
they come with a fixed, fairly large
amount of DRAM on a separate silicon
die integrated into the CPU package –
see Table 1. LPDDR is a variant of DDR
(double data rate) DRAM, described
in the preceding article, optimised for
low power consumption.
The main disadvantage of doing
this is obvious: you cannot expand
the RAM on these machines. Also, the
chips are quite expensive to fabricate.
However, the performance benefits are
significant.
While the M1 and M2 cores are individually not especially fast by today’s
standards, because the onboard RAM
has so much bandwidth and so little
latency (the delay between making a
request and the memory read/write
being performed), they punch well
above their weight in terms of performance, at least in certain tasks.
Unsurprisingly, memory-intensive
tasks benefit the most from this
arrangement, eg, database manipulation. Mathematically-intensive tasks
benefit too, but not to the same extent.
DDR5 advancements
The latest computer memory standard, DDR5, is an evolution of the
now-mature DDR4 standard that has
been around since 2014. Besides manufacturing process improvements
allowing higher speeds at lower voltages, the main enhancements to DDR5
are the addition of local voltage regulation and the splitting of the 64-bit data
channel into two 32-bit channels with
double the maximum burst length.
While DDR4 started at 2133MT/s
(megatransfers per second), a typical DDR4 DIMM these days is rated
at between 3200MT/s and 4000MT/s.
DDR5 starts at 3200MT/s, with a typical DIMM being capable of 4800MT/s
A Micro M4TC 128kB DRAM (Dynamic Random Access Memory) chip. DRAM
typically uses a single capacitor and transistor to store one bit of data rather
than multiple transistors for SRAM. DRAM is much cheaper due to a higher
density of components per bit, but in turn uses more power than SRAM. Source:
https://w.wiki/63EQ
Australia's electronics magazine
February 2023 27
and some well over 5000MT/s.
For DDR4, switch-mode voltage regulator(s) on the motherboard produce
the ~1.2V needed for the RAM chips
to operate, fed to them via several
edge-connector pins. Instead, DDR5
receives a higher voltage (either 5V
or 12V) that is stepped down to the
required voltage via an onboard regulator that’s usually in the middle of
the DIMM.
This has several advantages but primarily tighter voltage regulation, especially when there are transients. The
baseline operating voltage for DDR5 is
1.1V with a typical maximum of 1.35V,
compared to 1.2-1.6V for DDR4.
As for splitting the data channel in
two, the goal is to reduce latency when
memory is being accessed in a ‘scatter-
gather’ manner rather than sequentially. Importantly, DDR5 DRAM
chips have 32 banks compared to the
16 banks of DDR4, meaning that less
bank switching is required, so average
throughput is improved.
The maximum capacity of a DDR5
DIMM is 512GiB, meaning up to 2TiB
of RAM in a four-slot system compared
to 128GiB per DIMM for DDR4.
In short, while DDR5 is a significant upgrade over DDR4 (as demonstrated by benchmarks and performance tests), that is due to several
minor improvements rather than any
revolutionary upgrades.
Older DDR generations
As mentioned earlier, DDR4 came
out in 2014. Before that, DDR3 ruled
the roost for almost a decade, since
2007. DDR4 was also an evolutionary
upgrade from DDR3, again mainly due
to process improvements. DDR3 modules typically operated at 1.5V compared to the 1.2V of DDR4, so they
used quite a bit more power.
Compared to the 2133-5000MT/s
of DDR4, DDR3 had a much lower
throughput at 800-2133MT/s (and
rarely up to 3200MT/s). DDR3 DIMMs
also topped out at around 16GiB compared to 128GiB for DDR4. DDR4 also
SDR
DDR
QDR
2 signals
per
clock cycle
Double Data Rate
A diagram showing how
the clock signal differs
between SDR, DDR and
QDR. Source:
https://w.wiki/63sx
4 signals
per
clock cycle
Quad Data Rate
clock cycle
Silicon Chip
doubled the number of banks from 8
to 16.
Going back further, it’s much the
same story for DDR2 (released in
2003) compared to DD3. DDR2 operated at even higher voltages (starting
at around 1.8V), so it was even more
power-hungry and slower at 4001066MT/s. DDR2 also topped out at
8GB per DIMM, although this was
very rare compared to the typical 2GB
per DIMM.
DDR2 brought a significant upgrade
from the original DDR standard
(released in 1998). With DDR2, the
memory interface bus is clocked at
twice the rate of the DRAM chips themselves, so four sets of data can be transferred per memory clock cycle compared to two for DDR1. DDR1 DIMMs
also had fewer pins (184 vs 240). DDR2
also optionally doubled the number of
banks from four to eight.
DDR1 DIMMs operated at just 200400MT/s and had a maximum capacity of 1GiB per DIMM, limiting most
desktop systems to a maximum of
1 signal
per
clock cycle
Single Data Rate
28
Most DDR2-DDR5 memory (DIMM package) will look similar, with the
exception of any fancy heatsinks. DDR1 memory in comparison only has
184 pins versus the 240 pins in DDR2-DDR5 memory. This type of memory is
typically used in computers and is a form of synchronous DRAM, which have
an external clock signal. The photo above shows a set of four DDR3 modules.
clock cycle
Australia's electronics magazine
4GiB. They ran at a whopping 2.5-2.6V,
more than double what DDR5 needs!
2GiB DDR1 DIMMs might have
been sold specifically for servers, but
it likely would not register as the correct amount of memory in a typical
desktop machine.
Conclusion
DDR DRAM will be used as the primary memory for computers for some
time, until something better comes
along; nobody knows when or what
that will be. QDR (quad data rate)
DRAM, which performs four transfers per clock cycle, was briefly tried
by Intel in the mid-2000s but never
really took off. GDDRX5 video memory
chips from 2016 also had an optional
QDR mode.
DDR performs one transfer on the
negative clock edge and one on the
positive, while QDR does the same but
also performs transfers during the positive and negative plateaus. However,
it seems that the added complexity
isn’t worthwhile, given that this does
nothing to reduce access latency.
These days, the best performance
seems to come from a combination
of highly parallel DRAM, which provides exceptionally high throughputs,
with relatively large and very fast local
SRAM caches such as AMD’s “Infinity
Cache” on its RDNA2 (128MiB cache)
and RDNA3 (96MiB to 384MiB cache)
graphics processors (GPUs).
SC
siliconchip.com.au
|