0% found this document useful (0 votes)
132 views5 pages

Agilex Fpga Video Processing Solutions White Paper

Uploaded by

Bruse Slim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views5 pages

Agilex Fpga Video Processing Solutions White Paper

Uploaded by

Bruse Slim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

White Paper

FPGA Video Processing

How Intel Agilex® FPGA is Enabling


Resource and Power Efficient 4K, 8K
Video Processing Solutions
Authors Introduction
The new Intel Agilex® FPGA family is designed to operate at higher frequencies
Neil Childs than previous families. This enables FPGA developers to minimize resource usage
Engineering Manager and power for a given logic function. The ability to reach 600 MHz, often without
Intel Programmable Solutions Group requiring extensive rewriting of existing register transfer level (RTL), is of particular
interest to video designers as it enables “4K” video at 60 frames-per-second to be
processed as 1 pixel-in-parallel (PIP).

4 PIP at 150 MHz 2 PIP at 300 MHz 1 PIP at 600 MHz


Cyclone® V FPGA Intel® Arria® 10 FPGA Intel Agilex® FPGA

The ability to run half the processing pipelines at double the clock frequency,
compared to previous FPGA families, will become more important with the advent
of “8K” video solutions. It means that video IP cores already capable of operating at
4 pixels-in-parallel will not require re-architecting with 8 pixels-in-parallel support
for “8K”.

Video Clock Frequencies


Table of Contents
Video resolutions have increased over the decades from SD (720x486), through
Introduction . . . . . . . . . . . . . . . . . . . . . 1 HD (1920x1080) and UHD “4K” (3840x2160) to “8K” (7,680x4,320) and beyond.
Video Clock Frequencies. . . . . . . . . . 1 The clock frequency required to handle this increasing bandwidth has likewise
increased. The “pixel clock” for SD resolution video at 60 frames per second (fps)
Real-World FPGA Clock was a mere 13.5 MHz; easily accomplished today but challenging at its introduction
Frequencies. . . . . . . . . . . . . . . . . . . . . . 2 in the early 90’s. High definition (HD) video resolutions required clock frequencies
IP Core Resource Usage. . . . . . . . . . . 2 of 74.25 MHz or 148.5 MHz, which again were challenging but achievable for their
era. Today's “4K” resolution requires a pixel clock of 594 MHz, in excess of what
Power Savings. . . . . . . . . . . . . . . . . . . . 3 FPGAs could realistically reach until very recently, while “8K” needs 2,376 MHz.
Wider System Improvements. . . . . . 3 These very high clock rates forced a different approach from video engineers.

Full System Case Study . . . . . . . . . . . 3 When “4K”, or UHDTV, was first being widely developed around a decade ago,
the typical FPGA families used were the Stratix® V FPGA or Intel® Arria® 10 FPGA
Resource Usage . . . . . . . . . . . . . . . . 4 families, which were not intended to reach 594 MHz. To cope with this limitation,
Pin Count . . . . . . . . . . . . . . . . . . . . . . 4 video intellectual property (IP) cores such as scalers or color space converters,
were redesigned to process multiple pixels on each clock cycle. In the majority of
Device Selection . . . . . . . . . . . . . . . 4 cases this meant duplicating the entire video pipeline within the IP core. Moving
Conclusions. . . . . . . . . . . . . . . . . . . . . . 5 from 1 pixel-in-parallel to 2 pixels-in-parallel (PIP) for “4K” video could result in a
White Paper | How Intel Agilex® FPGA is Enabling Resource and Power Efficient 4K, 8K Video Processing Solutions

doubling of FPGA resources used. The current early adopters Second generation of Intel Hyperflex FPGA Architecture in
of “8K” video designs often rely on a similar technique of Intel Agilex FPGAs includes an improved High-Speed Bypass
processing 8 pixels-in-parallel, with predictable increases in Path, which improves default performance with RTL, which is
FPGA utilization. not suitable for Intel Hyperflex FPGA Architecture. While the
fine grain Hyper-Retiming allows the design tools to extract
An FPGA family such as Intel Agilex FPGA, which can truly
the fastest possible performance from the routing resources.
handle real video designs at 594 MHz, therefore allows
These changes mean Intel Agilex FPGAs can deliver an up to
designers to halve the size of certain video IP cores when
40% increase in core performance.
compared with previous families.
The overall effect of this is that many existing video IP cores
It should be noted that for video, unlike some other
originally designed to run at 300 MHz in an Intel Arria 10
technology areas, resource utilization does not scale linearly
FPGA can now comfortably run at 600 MHz in an Intel Agilex
with frequency. The device either reaches 594 MHz, or it
device with limited modification.
does not. You either need 2X the resources, or you do not.
An FPGA only capable of 550 MHz may allow extra headroom
for routing and timing closure, but the video logic will likely
remain clocked at 300 MHz. Such steps in clock frequencies IP Core Resource Usage
do exist in other technology areas, for example PCIe has The table shows the resources required for three video IP
such steps at 62.5 MHz, 125 MHz, 250 MHz, and 500 MHz. cores configured with either 2 or 1 pixel-in-parallel support
Each frequency step you can achieve, halves the width of the sufficient to handle “4K” resolution video with a processing
datapath saving fabric and routing resources inside the FPGA. clock of 300 MHz or 600 MHz respectively.
It should further be noted that the resource saving is not
uniform for all resource types. For some IP cores, such as a ALMs M20Ks DSPs
3D LUT, halving the number of pixels processed in parallel
will roughly halve all the resources required (typically 3D LUT 2 PIP 2,464 222 12
adaptive logic modules (ALMs), M20Ks, and digital signal
3D LUT 1 PIP 1,326 54% 111 50% 6 50%
processors (DSPs)). However, for a core such as a video
scaler, the linestore memories required for a vertical filter Color space 2 PIP 711.4 0 12
are required once per IP core regardless of the duplicated
structures. Averaged over a whole design this means that the Color space 1 PIP 484.6 69% 0 - 6 50%
savings made in ALMs and DSPs are not completely matched Scaler 2 PIP 3,713.8 54 48
by savings in M20Ks. While usage of these memory blocks
will reduce, it is not typically on the same scale as other Scaler 1 PIP 2,134.8 58% 50 93% 24 50%
resource types. This leads to the interesting observation that
Tone Mapper 2 PIP 10,758 71 107
faster FPGAs will benefit from a slightly different ratio of
resource types, with more M20Ks being most useful. Tone Mapper 1 PIP 7,504 70% 49 70% 58 55%

Warp 2 PIP 9,550.0 477 72


Real-World FPGA Clock Frequencies Warp 1 PIP 5,767.1 61% 347 73% 36 50%
Few FPGA designs ever run close to the theoretical maximum
Average size of 1 PIP IP core
frequency of the part used. To achieve anything close to compared against 2 PIP IP 64% 68% 52%
the maximum, the designer would have to manually tune core
the RTL, effectively hand placing every DSP, M20K and
ALM. Historically it would also mean keeping registers close
together to minimize routing delays; long routing lines Figures should be considered approximate and have been
and high fanouts were a common cause of reduced overall taken from the Intel® Quartus® Prime Pro Edition Software
performance. This issue led to the introduction of Intel® v21.2.
Hyperflex™ FPGA Architecture routing in Intel® Stratix® 10 It can be clearly seen that doubling the processing clock
FPGA, which alleviated the routing bottleneck on clock frequency results in significant resource savings, particularly
frequency, although often required extra pipelining registers in ALM and DSP usage. When extrapolated over an entire
to be added to the RTL. design, such savings could easily mean a design fits in a
Intel Agilex FPGAs include many changes intended to address smaller part or allows space for additional functionality.
common FPGA performance bottlenecks. The process gains
of moving from 14 nm to 10 nm manufacturing, have enabled
DSPs, M20Ks, and general FPGA fabric to run much closer to
the maximum frequency for wider use cases. For example,
provided the systolic registers are enabled using the chainin/
chainout no longer causes the useable DSP frequency to
decrease. It is now possible to run the DSPs at 676 MHz even
in the slower -3V speed grade.

2
White Paper | How Intel Agilex® FPGA is Enabling Resource and Power Efficient 4K, 8K Video Processing Solutions

Power Savings
The dynamic power required by a single register switching
at 600 MHz is similar to two registers switching at 300 MHz
as can clearly be seen by creating entries in the Intel FPGA
Power and Thermal Calculator (PTC) tool.

The reduction in static power, however, often outweighs this


increase in dynamic power leading to a power advantage
when switching to 600 MHz. While it is likely that the largest
reduction in static power would be achieved by switching
to a smaller FPGA, it is still possible to achieve meaningful
static power reductions while remaining in the same part. For These figures are taken from the PTC tool included with the
example, Intel Agilex devices support DSP and M20K power Intel Quartus Prime Pro Edition Software v21.2 and should
gating, so any resources saved in these areas will directly lead be considered approximate. Also, note that preliminary Intel
to static power reduction. Agilex FPGA power models have been used.

To demonstrate this, two test designs were constructed, each


with four instances of the 3D LUT, Tone Mapper, and Warp Wider System Improvements
IP cores. The two designs were configured for either 600
MHz 1 pixel-in-parallel (PIP) video data, or 300 MHz 2PIP Complete video systems typically include a wide range of
video data. The reduction in resource usage is shown in the functionality from other areas of technology, for example
following table. they often rely on embedded processors or interconnect
such as PCIe for control, and external DDR memory for
storage. Improvements in FPGA clock frequencies also
300 MHz 600 MHz 600 MHz as % enable progress to be made in these areas.
Resource
Variant Variant of 300 MHz
The ability to support the latest external memory standards
94,289 61,907
ALMs 65.7% and speeds are increasingly linked to the FPGA fabric speed.
(19% of device) (13% of device)
The internal interface is usually clocked at a quarter of the
2,294 1,642 memory clock frequency (or an eighth of the headline DDR
M20Ks 71.6%
(32%) (23%)
figure). For example, a 64 bit DIMM of DDR4 3,200 MHz will
724 376 result in an internal interface that is 512 bits wide and clocked
DSPs 51.9%
(16%) (8%) at 400 MHz. Future support for DDR5 4,400 MHz memory in
Intel Agilex 7 FPGA M-Series will require the FPGA fabric to
The power usage figures estimated by the PTC tool, assuming support 550 MHz.
a 25% toggle rate and a constant 85 degree junction
The same is true of the latest PCIe interfaces. For a number
temperature are shown below. These figures clearly show
of years, designers have chosen to increase interface width
an overall power reduction, with the 600 MHz design on the
rather than increase the clock frequency beyond 250 MHz.
right, of around 0.5W.
The ability of Intel Agilex FPGA fabric to comfortably meet
500 MHz effectively allows an upgrade from PCIe 4.0 to
PCIe 5.0, or from 8 lanes to 16 lanes without increasing the
interface width and consuming more routing resources.

Full System Case Study


We have recently considered the design of a warp solution
for a 4K120 projector, and specifically compared the options
of running in an Intel Arria 10 FPGA at 300 MHz, or an Intel
Agilex FPGA running at 600 MHz.
Control was to be handled by an embedded processor,
which would also compute the warp mesh required, a
mathematically complex task. For this reason, an SoC was
chosen and external communication would be via Ethernet.

3
White Paper | How Intel Agilex® FPGA is Enabling Resource and Power Efficient 4K, 8K Video Processing Solutions

External DDR4 32bit

Hard EMIF Controller


Hard Processor System I2C, GPIO

Timebase
AXI Interface for
Register Access Timebase

TPG
Warp 3D LUT

Crosspoint

Combiner
PreScale 1 Input Video Tx
1 Output
Video Rx 2 or 4 Engine OSD

Video
Hard EMIF Controller DMA

External DDR4 32bit or 64bit Reference


XTAL

The external video interfaces were specified to be HDMI Resource Usage


at the input, and V-by-One at the output. To handle 4K120
Module 300 MHz 600 MHz
V-by-One requires 16 transceiver lanes, each running at a
fixed line rate of 3 Gbps. Genlock and alternative resolution ALMs M20Ks DSPs ALMs M20Ks DSPs
video were to be handled by adjusting horizontal and vertical PreScaler 7,735 58 96 3,733 54 48
blanking periods as required, simplifying the clocking of the
output video interface. Warp 16,590 750 144 9,502 477 72
3D LUT 2,464 222 12 1,326 111 6
Internally, the warp processor is required to process
1,200*106 pixels per second. This requires two processing Video DMA 2,699 36 0 2,405 21 0
engines running at 600 MHz, or four processing engines VideoIO 5,000 20 0 5,000 20 0
running at 300 MHz.
Miscellaneous 6,000 10 6 5,000 6 3
Total 40,488 258 26,966 689 129
32bit CPU Access

Warp
Video Figures for video I/O and miscellaneous are estimates for
DMA
illustration

system mm_interconnect Pin Count

* 512bit@200MHz The faster Intel Agilex FPGA I/O allows the use of a narrower
or DDR4 3,200 MHz memory interface. This saves 44 FPGA pins
Hard EMIF Controller 256bit@400MHz
that has several advantages including, reduced I/O power
consumption, potential for smaller device package, fewer
64bit@1600MHz or 32bit@3200MHz
external memory components, and simpler PCB layout and
External DDR4 32bit or 64bit routing.

Transceiver
Function I/O
(XCVR)
The memory bandwidth required for a 4K120 bounce through Clock / reset / HPS (flash, ethernet, uart) ~25
memory equates to 66.4 Gbps (3,840*2,250*120*32*2).
Video I/O ~10 4 RX + 16 TX
The design also required a 1080p overlay to be read from
external memory, adding 8.3Gbps (1,920x1,125x32x120), for CPU memory – DDR4 1,600 MHz * 32 bit 75
a total memory bandwidth of 74.7 Gbps. This was considered
Video memory – DDR4 1,600 MHz * 64
too high to be comfortably accommodated by a 32 bit DDR 75 or 119
bit or 3,200 MHz * 32 bit
2,400 MHz interface (76.8 Gbps). As a result, the Intel Arria
10 FPGA variant required a 64 bit memory interface at 1,600
MHz (102.4 Gbps), whereas the Intel Agilex FPGA could stick
with a 32 bit interface and instead use faster 3,200 MHz
memory (also 102.4 Gbps). The internal memory interface is
therefore 512 bits at 200 MHz for Intel Arria 10 FPGA versus
256 bits at 400 MHz for Intel Agilex FPGA.

4
White Paper | How Intel Agilex® FPGA is Enabling Resource and Power Efficient 4K, 8K Video Processing Solutions

Device Selection
For the 300 MHz Intel Arria 10 FPGA variant, the number
of M20Ks required means that the SX480 is the smallest
possible device, even though this provides many more ALMs
and pins than required. If the Intel Arria 10 FPGA family was
capable running these video IP cores at 600 MHz, then it
would have been possible to move two devices smaller to the
SX270 device, which is also available in a smaller package.
This solution would obviously have offered significant cost
advantages and potentially power savings.
Either design would easily fit in what is the smallest AGF006
Intel Agilex device but a smaller design, even though running
faster, is still likely to offer a power reduction as the reduction
in static power usually outweighs the slight increase in
dynamic power. Future “8K” video designs will likely come
closer to filling Intel Agilex devices, in which case the
resource saving offered by 600 MHz operation is again likely
to mean selecting a smaller device.

Conclusions
• Intel Agilex FPGA is the first FPGA to comfortably reach
600 MHz, enabling integration of complex video systems
at this important clock frequency
• This performance is “push-button”, delivered with only
limited changes to RTL
• 600 MHz reduces resource count and increases value
by fitting into a smaller device, yielding significant static
power savings
• 8K video will drive future adoption of 600 MHz processing
as larger designs mean larger savings

Intel technologies may require enabled hardware, software or service activation.


No product or component can be absolutely secure.
Your costs and results may vary.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.
Some results have been estimated or simulated.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-
exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
warranty arising from course of performance, course of dealing, or usage in trade.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

ISO 9001:2015 Registered


Version 1.2

You might also like