Webinar Rev Up Your Design Performance With PCI Express & DDR4 IP 20160216
Webinar Rev Up Your Design Performance With PCI Express & DDR4 IP 20160216
송영규 차장
Staff FAE, Altera Korea
Module Game Plan
High-Performance DDR4
Interface
Introduce DDR4
2
Rev Up Your Design Performance
with PCI Express & DDR4 IP
Agenda
4
Best-in-Class PCI Express IP – Hardened Protocol Stack
SR-IOV feature
− 2 Physical Functions (PFs) / 128 Virtual Functions (VFs)
− MSI / MSI-X interrupt support
− Expansion to 4 PFs / 4K VFs (1H’2016)
6
Performance & Productivity You Can Expect
Productivity
− Linux & Windows drivers available to expedite evaluation & design-in
− Device driver features
Multiple interface options to support various application models
− IOCTL, PIO, DMA
Character & block device driver support
3rd party, off-the-shelf, I/O benchmark tools can be used
− Iometer & fio
Open source code
License model is Dual BSD/GPL
7
Interface Comparisons
8
Configuration via Protocol (PCIe) Initialization
Introduction
Periphery
9
Partial Reconfiguration over Protocol (PCIe) Introduction
10
Illustration of Programming Options
CvP Initialization /
CvP Update / PRoP
CvP Initialization
t= 0
(Board powers up) Access to HIP1
Programming of “Fabric” (Image A) will
not happen until commanded by user
Programming of “Fabric” (Image B) will
not happen until commanded by user
1. HIP is in L0, but transmission of packets should wait until fabric is loaded.
11
How Programming Flow Looks in Hardware
CvP Initialization /
CvP Update / PRoP
CvP Initialization
Program Program Program
“Periphery” “Fabric” “Fabric”
(Img A) (Img B)
Steps #1 #2, 2a #3, 3a
t= 0 Timeline
(Board powers up)
across the PCIe I/O to program the fabric file. Programs periphery at
PMA
PCS
Flash /
3a. Img B – programming file streamed power-up
EEPROM
across the PCIe I/O to program the fabric (including PCIe HIP block).
Programming file is split:
periphery & fabric
Periphery
12
PCIe-to-DDR4 Block Diagram
= PCI Express Design Portion of Bridge Solution = DDR4 Design Portion of Bridge Solution
13
System Block Overview (AVMM with DMA)
DMA Engines
Qsys Module – PCIe AVMM with DMA
Memory Data Path
(64-bit) 3
DDR4 4 AVMM
Memory DDR4 Wr Data Mover Avalon
Master
Memory
Controller
AVMM
Master Rd Data Mover A
Memory-Mapped
256-bit interface
PCIe
1 AVST 2 5
R Hardened
B 3
Descriptor Controller can be external Avalon IP
Descriptor
to the Qsys Module. Enables user to 2 AVMM
Tx Slave -> Memory-Mapped
Controller 6 Slave (Gen3 x8)
insert their own custom descriptor 256-bit interface
3 AVMM
controller. Master Rx Master <-
1
− Directs data movers in regards to providing destination / source / direction & amount of data 2
− Modifications to existing or custom descriptor controller can be designed for application need
14
Qsys Integration of Snapshot
16
Agenda: Implementing DDR4
Overview
Implementation challenges
Addressing the challenges
Arria 10 memory interface architecture
Implementing Arria 10 DDR4 interfaces
Summary
17
DRAM Technology Comparison(1)
DDR3 DDR4
Voltage 1.5 V / 1.35 V 1.2 V
DQ Bus SSTL15 CTT POD12
Strobe Bi-directional differential Bi-directional differential
Strobe Configuration Per byte Per byte
READ Data Capture Strobe based Strobe based
Data Termination VDDQ/2 VDDQ
Address/Command Termination VDDQ/2 VDDQ/2
Burst Length BC4, 8 BC4, 8
Number of Banks 8 16
Bank Grouping No 4
No Command / address parity
On-Chip Error Detection
CRC for data bus
Configuration x4, x8, x16 x4, x8, x16
Package 78-ball / 96-ball FBGA 78-ball / 96-ball FBGA
Data Rate (Mbps/Pin) 800 – 2,133 1,600 – 3,200+
Component Density 1 GB – 8 GB 2 GB – 16 GB
Up to 8H (128-GB stack);
Stacking Options DDP, QDP
single load
Notes: 1. Data from www.micron.com and www.jedec.org
18
DDR4 Power Savings Features
19
Creating a Data Valid Window
469
313
21
DDR4 JEDEC Definition of the Data Valid Window
22
Calibration Is Critical to Offset Shrinking Margins
No Margin
Without
Calibration
23
Arria 10 Addresses the DDR4 Challenges
High-Level Output Topology
CLK
X+90 phase
X phase
DQ OUT1 Delay DQ OUT2 Delay DQ
Calibration knobs
− DQ-out1 and DQ-out2 delay : Control the delay applied to outgoing DQ
pins
− DQS-out1 and DQS-out2 delay : Control the delay applied to outgoing DQS
pins
− Write leveling output : Changes the delay on both DQ and DQS relative to
the memory clock-in phase taps
25
High-Level Input Topology
dqs_en ptap
vfifo control control DQS en dtap
control
DQS
Enable DQS IN Delay DQS Delay Chain
DDIOin
DQS in dtap
LFIFO control
DQ
DQ IN Delay
Lfifo control
DQ in dtap
control
Calibration knobs
− DQ-in delay: Control the delay applied to incoming DQ pins
− DQS-in delay: Control the delay applied to incoming DQS pins
: Controls number of cycles after read command that data is read
out of the LFIFO
− DQS-En phase: Control the delay on DQS En in phase taps
− DQS-En delay: Control the delay on DQS En in dtaps
− VFIFO: Adjusts the delay in cycles applied to controller-provided DQS burst signal
to generate DQS enable
26
Calibration Stages
Start
DQS-enable calibration
− Calibrate DQS enable (delayed read data valid) relative to DQS
Wait for PLL/DLL locking
Post-amble tracking
Initialize INST/AC ROM
− Track DQS-enable across temperature variation for all pins on this
Mem Interface
Read data deskew
− Calibrate DQS relative to read command (read leveling) Initialize the memory
(Mode Registers etc.)
− Calibrate DQ versus DQS (per-bit deskew) for reads Calibration loop
Calibrate
LFIFO training the Mem Interface
27
Calibration done by hard IP
Calibration Is Critical to Shrinking Margins
Margin
With
Calibration
28
Margin obtained via hard IP calibration
Arria 10 Memory Interface Overview
Arria 10 Innovative Memory Architecture
Increased flexibility
− Innovative architecture to mimic soft controller
− More configuration than a single controller
One controller in every IO Bank
− More widths (144-bit)
− More depths (multi-ranks)
Fast rollout
− DDR4 2,666 Mbps demo 4 weeks after silicon
− Maximum speed on all memory interfaces 12 weeks after silicon
− Guarantees schedule for software and characterization
31
Arria 10 I/O Sub-System
HSSI
Organized in columns of
− I/O banks (groups of 48 I/Os)
− I/O Aux (one per column)
I/O Sub-System
32
Arria 10 I/O Sub-System
Supports
− General Purpose I/Os (GPIO)
I/O Registers & I/O Buffers
− PLLs
IOPLL for EMIF and user logic
− External Memory Interfaces (EMIF)
Hard Memory Controller
Hard PHY
Hard MCU / Calibration logic
DLL
− On-chip Termination Control (OCT)
− LVDS or 2.5V/3V
33
Arria 10 I/O Banks
LVDS / DDR I/O
− Supports LVDS and single ended up to 1.8V
− LVDS pairs configurable as input or output
− Delivers highest performance
3V / DDR I/O
− Supports single ended up to 3V
− 3.3V input tolerant
− Limited to 533 MHz I/O Standard LVDS / DDR I/O 3V / DDR I/O
performance
3V LVTTL/ CMOS N Y
JTAG interface 2.5V CMOS N Y
− Limited to 1.8V 1.8V CMOS Y Y
1.5V CMOS Y Y
1.2V CMOS Y Y
LVDS Y N
34
Arria 10 EMIF Example – 72bit Interface (3 banks)
35
1 x 8 pin
1 x 144 pin
Additional Examples (1 banks)
(6 banks)
Data
Controller
Addr/Cmd Data
Addr/Cmd
Data
Addr/Cmd
Data
Data
Data 1 Data
Data
Controller 1
Controller 1
Addr/Cmd 1 Addr/Cmd 1
Data Data
Data
Addr/Cmd 1 Addr/Cmd 1
Data Data
Data
Addr/Cmd 1 Addr/Cmd 1
Data Data
Data
Controller
Addr/Cmd
Data Data 1 Controller 2 Data 1 Data
Controller
Controller
Data
Data 2 Data 2 Data
Controller 2
Data
Addr/Cmd 2 Data 2 Data
Data
Addr/Cmd 2 Data
Data
Addr/Cmd 2 Data
Data
37
Key Controller Features: DDR4
38
Arria 10 EMIF IP:
GUI & Generation
Selecting the Arria 10 Memory Interface IP
40
New EMIF IP for Arria 10 Devices
Design output:
− Generates IP in < 20s
− Generates well-documented IP
41
Each Protocol Has Own Set of Parameters
DIAGNOSTICS_CORE_ACCESS
DDR3 DDR3
MEM_DDR3_DQ_WIDTH
DDR4 DDR4
RLD3 RLD3
MEM_RLD3_DQ_WIDTH
… …
42
Arria 10 EMIF Generation Output
Arria 10 EMIF IP core generates:
− Synthesis fileset for compilation (Altera EMIF IP core only)
− Simulation fileset for simulation (Altera EMIF IP core only)
− Example design for synthesis
Includes traffic generator
− Example design for simulation
Includes traffic generator and Altera memory model
Top-level IP is clear-text Verilog
− Optional VHDL wrapper Example Testbench
Memory Model
Memory Driver
Includes scripts (QIP, SDC) AFI (Traffic
Generator)
PHY Controller Pass/Fail
Readme “datasheet”
− Ports & parameters w/ descriptions
Example design is dynamic Arria 10 EMIF IP Core
− Matches IP configuration Example Design
43
Arria 10 EMIF IP:
Simulation
EMIF Simulation Use Cases
Two main use cases for simulation:
1. System-level Simulation
− Purpose: Focus on user logic
− EMIF IP Core: Just needs to provide interface to store/retrieve data
Could be an abstracted model
Latency and efficiency of interface need to be accurate
User is not interested in details of core, nor calibration
− Simulation time expectation: Fast
2. EMIF-specific Simulation
− Purpose: Explore the details of EMIF IP core, calibration
− EMIF IP Core: Must be accurately represented versus hardware
User may want to see individual transactions
Visibility into calibration algorithm and stages may be required
This level of detail can be used to support debug
− Simulation time expectation: Slow
45
EMIF Simulation Use Cases: Arria 10 Solution
46
Arria 10 EMIF IP:
Compilation & Timing Closure
Key Arria 10 EMIF Usability Enhancements
48
Understanding the Arria 10 EMIF Fitter
49
Arria 10 EMIF Timing Analysis
FPGA
C2C
Package Traces
FPGA Core
C2P PHY Memory
P2C
P2P
50
TimeQuest DDR4 Timing Example
51
Summary
52
Hardened Memory Controller & PHY
DDR4 Support Up
to 2666 M bps
High performance hardened memory Industry’s Fastest
controller Arria 10 FPGA
− Built-in timing closure shortens engineering
cycles Core Fabric
User Design
− Saves logic and memory resources
5k LEs and 29 M20K blocks per x72 DDR3 IF
− Up to x144 support AXI/Avalon IF
− Up to 4 x72 DDR3 interfaces on single device
Memory Controller
Hardened memory controller supports
− DDR4, DDR3, LPDDR3 PHY Interface