Embedded Design Workshop Student Guide
Embedded Design Workshop Student Guide
Student Guide
C6000 Embedded Design Workshop Student Guide, Rev 1.20 November 2013
Technical Training
0-1
Notice
Notice
Creation of derivative works unless agreed to in writing by the copyright owner is forbidden. No portion of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission from the copyright holder. Texas Instruments reserves the right to update this Guide to reflect the most current product information for the spectrum of users. If there are any differences between this Guide and a technical reference manual, references should always be made to the most current reference manual. Information contained in this publication is believed to be accurate and reliable. However, responsibility is assumed neither for its use nor any infringement of patents or rights of others that may result from its use. No license is granted by implication or otherwise under any patent or patent right of Texas Instruments or others.
Copyright 2013 by Texas Instruments Incorporated. All rights reserved. Technical Training Organization Semiconductor Group Texas Instruments Incorporated 7839 Churchill Way, MS 3984 Dallas, TX 75251-1903
Revision History
Rev 1.00 - Oct 2013 - Re-formatted labs/ppts to fit alongside new TI-RTOS Kernel workshop Rev 1.10 Oct 2013 Added chapter 10 (Dyn Memory) as first optional chapter Rev 1.20 Nov 2013 upgraded all labs to use UIA/SA
0-2
Objectives
Objectives
Compare/contrast static and dynamic systems Define heaps and describe how to configure the
different types of heaps (std, HeapBuf, etc.)
10 - 1
Module Topics
Module Topics
Using Dynamic Memory....................................................................................................... 10-1 Module Topics.................................................................................................................... 10-2 Static vs. Dynamic .............................................................................................................. 10-3 Dynamic Memory Concepts................................................................................................ 10-4 Using Dynamic Memory.................................................................................................. 10-4 Creating A Heap ............................................................................................................. 10-6 Different Types of Heaps .................................................................................................... 10-7 HeapMem ...................................................................................................................... 10-7 HeapBuf ......................................................................................................................... 10-8 HeapMultiBuf.................................................................................................................. 10-9 Default System Heap.................................................................................................... 10-10 Dynamic Module Creation ................................................................................................ 10-11 Custom Section Placement .............................................................................................. 10-13 Lab 10: Using Dynamic Memory ...................................................................................... 10-15 Lab 10 Procedure Using Dynamic Task/Sem .............................................................. 10-16 Import Project ............................................................................................................... 10-16 Check Dynamic Memory Settings ................................................................................. 10-17 Inspect New Code in main().......................................................................................... 10-18 Delete the Semaphore and Add It Dynamically ............................................................. 10-18 Build, Load, Run, Verify ................................................................................................ 10-19 Delete Task and Add It Dynamically ............................................................................. 10-20 Additional Information....................................................................................................... 10-22 Notes ............................................................................................................................... 10-23 More Notes ................................................................................................................... 10-24
10 - 2
Static Memory
Link Time: - Allocate Buffers Execute: - Read data - Process data - Write data
Allocated at LINK time + Easy to manage (less thought/planning) + Smaller code size, faster startup + Deterministic, atomic (interrupts wont mess it up) - Fixed allocation of memory resources Optimal when most resources needed concurrently
Create: - Allocate Buffers Execute: - R/W & Process Delete: - FREE Buffers
+ Limited resources are SHARED + Objects (buffers) can be freed back to the heap + Smaller RAM budget due to re-use - Larger code size, more difficult to manage - NOT deterministic, NOT atomic Optimal when multi threads share same resource or memory needs not known until runtime
10 - 3
Internal SRAM
Stack
External Memory
EMIF
Common memory reuse within C language A Heap (i.e. system memory) allocates, then frees chunks of memory from a common system block
CPU
Heap
Data Cache
Code Example
Dynamic C Coding
#define SIZE 32 x=malloc(SIZE); a=malloc(SIZE); x={}; a={}; filter(); free(x); free(a);
// MAUs // MAUs
High-performance DSP users have traditionally used static embedded systems As DSPs and compilers have improved, the benefits of dynamic systems often allow enhanced flexibility (more threads) at lower costs
10 - 4
Internal SRAM
Stack
External Memory
EMIF
Common memory reuse within C language A Heap (i.e. system memory) allocates, then frees chunks of memory from a common system block
CPU
Heap
Data Cache
Say, a big image array off-chip, and Fast scratch memory heap on-chip?
Multiple Heaps
Program Cache
BIOS enables multiple heaps to be created Create and name heaps in .CFG file or via C code Use Memory_alloc() function to allocate memory and specify which heap
Internal SRAM
Stack
External Memory
Heap2 EMIF
CPU
Heap
Data Cache
10 - 5
Memory_alloc()
Standard C syntax
#define SIZE 32 x=malloc(SIZE); a=malloc(SIZE); x={}; a={}; filter(); free(a); free(x);
Notes:
x = Memory_alloc(NULL, SIZE, align, &eb); x = {}; a = {}; filter(); Memory_free(NULL,x,SIZE); Custom heap Error Block (more details later)
Memory_free(myHeap,a,SIZE);
- malloc(size) API is translated to Memory_alloc(NULL,size,0,&eb) in SYS/BIOS - Memory_calloc/valloc also available
Creating A Heap
1 Use HeapMem (Available Products)
2n
Usage
10 - 6
HeapBuf
HeapMultiBuf
Specify variable-size blocks, but internally, allocate from a variety of fixed-size blocks
HeapMem
HeapMem
Most flexible allows allocation of variable-sized blocks (like malloc()) Ideal when size of memory is not known until runtime Creation: .CFG (static) or C code (dynamic) Like malloc(), there are drawbacks:
NOT Deterministic Memory Manager traverses linked list to find blocks
HeapMem
10 - 7
HeapBuf
HeapBuf
HeapBuf_create() HeapBuf BUF BUF BUF BUF HeapBuf_delete()
Allows allocation of fixed-size blocks (no fragmentation) Deterministic, no reentrancy problems Ideal when using a varying number of fixed-size blocks (e.g. 4-6 buffers of 64 bytes each) Creation: .CFG (static) or C code (dynamic) For blockSize=64: Ask for 16, get 64. Ask for 66, get NULL
How do you create a HeapBuf?
Creating A HeapBuf
prms.blockSize = 64; prms.numBlocks = 8; prms.bufSize = 256; myHeapBuf = HeapBuf_create(&prms, &eb); buf1 = Memory_alloc(myHeapBuf, 64, 0, &eb) What if I need multiple sizes (16, 32, 128)?
Usage
10 - 8
16 32 32
Multiple HeapBufs
16 16 16 16 16 32 32 128 128 128 128 128 32 32
16 32 32
16
Given this configuration, what happens when we allocate the 9th 16-byte location from heapBuf1? What mechanism would you want to exist to avoid the NULL return pointer?
HeapMultiBuf
HeapMultiBuf
16 32 32 16 16 32 32 16 128 128 128 128 128 16 32 32 16 16 32 32 16
Allows variable-size allocation from a variety of fixed-size blocks Services requests for ANY memory size, but always returns the most efficient-sized available block Can be configured to block borrow from the next size up Creation: .CFG (static) or C code (dynamic) Ask for 17, get 32. Ask for 36, get 128.
10 - 9
BIOS automatically creates a default system heap of type HeapMem How do you configure the default heap? In the .CFG GUI, of course:
align
10 - 10
Allocates memory for object out of heap Returns a Module_Handle to the created object Frees the objects memory
Modules
Hwi Swi Task Semaphore Stream
params
Module_delete
Example:
Semaphore creation/deletion:
#define COUNT 0
Semaphore_Handle hMySem; hMySem = Semaphore_create(COUNT,NULL,&eb); Semaphore_post(hMySem); Semaphore_delete(&hMySem); Note: always check return value of _create APIs !
Task_Params_init(&taskParams); taskParams.priority = 3; hMyTsk = Task_create(myCode,&taskParams,&eb); C // MyTsk now active w/priority = 3 ... X D Task_delete(&hMyTsk);
taskParams includes: heap location, priority, stack ptr/size, environment ptr, name
10 - 11
Usage
Most SYS/BIOS APIs that expect an error block also return a handle to the created object or allocated memory If NULL is passed instead of an initialized Error_Block and an error occurs, the application aborts and the error can be output using System_printf(). This may be the best behavior in systems where an error is fatal and you do not want to do any error checking The main advantage of passing and testing Error_block is that your program controls when it aborts. Typically, systems pass Error_block and check resource pointer to see if it is NULL, then make a decision
Can check Error_Block using: Error_check()
10 - 12
Problem #1: You have a function or buffer that you want to place at a specific address in the memory map. How is this accomplished? .myCode myFxn Mem1 .myBuf
myBuffer
Mem2
Problem #2: have two buffers, you want one to be linked at Ram1 and the other at Ram2. How do you split the .bss (compilers default) section??
.bss
buf1 buf2
Ram1
buf1
Ram2
buf2
myFxn & myBuffer is the name of the fxn/var .myCode & .myBuf are the names of the custom sections
10 - 13
Build Linker
app.cfg
MEMORY { } SECTIONS { }
.map
userlinker.cmd
app.out
Create your own linker.cmd file for custom sections CCS projects can have multiple linker CMD files May need to create custom MEMORY segments also (device-specific) .bss: used as protection against custom section not being linked w warns if unexpected section encountered
10 - 14
Procedure
Import archived (.zip) project (from Task lab) Delete Task/Sem objects (for ledToggle) Write code to create Task/Sem Dynamically Build, Play, Debug Use ROV/UIA to debug/analyze Scheduler
Hwi Task
Semaphore_post(LedSem); ledToggle() { while(1) { Semaphore_pend(LedSem); Toggle_LED; } }
BIOS_start();
Hwi ISR
ledToggleTask
Idle
Time: 30 min
10 - 15
Import Project
1. Open CCS and make sure all existing projects are closed. Close any open projects (right-click Close Project) before moving on. With many main.c and app.cfg files floating around, it might be easy to get confused about WHICH file you are editing. Also, make sure all file windows are closed. 2. Import existing project from \Lab10. Just like last time, the author has already created a project for you and its contained in an archived .zip file in your lab folder. Import the following archive from your /Lab_10 folder: Lab_10_TARGET_STARTER_blink_Mem.zip Click Finish. The project blink_TARGET_MEM should now be sitting in your Project Explorer. This is the SOLUTION of the earlier Task lab with a few modifications explained later. Expand the project to make sure the contents look correct. 3. Build, load and run the project to make sure it works properly. We want to make sure the imported project runs fine before moving on. Because this is the solution from the previous lab, well, it should build and run. Build fix errors. Then run it and make sure it works. If all is well, move on to the next step If youre having any difficulties, ask a neighbor for help
10 - 16
Check the Runtime Memory Options and make sure the settings below are set properly for stack and heap sizes.
We need SOME heap to create the Semaphore and Task out of, so 256 is a decent number to start with. We will see if it is large enough as we go along. Save app.cfg. The author also wants you to know that there is duplication of these numbers throughout the .cfg file which causes some confusion especially for new users. First, BIOS Runtime is THE place to change the stack and heap sizes. Other areas of the app.cfg file are followers of these numbers they reflect these settings. Sometimes they are displayed correctly in other modules and some show zero. No worries, just use the BIOSRuntime numbers and ignore all the rest. But, you need to see for yourself that these numbers actually show up in four places in the app.cfg file. Of course, BIOSRuntime is the first and ONLY place you should use. However, click on the following modules and see where these numbers show up (dont modify any numbers just click and look): Hwi Memory Program
Yes, this can be confusing, but now you know. Just use BIOSRuntime and ignore the other locations for these settings. Hint: If you change the stack or heap sizes in any of these other windows, it may result in a BIOS CFG warning of some kind. So, the author will say this one more time ONLY use BIOS Runtime to change stack and heap sizes.
10 - 17
As you go through this lab, you will be uncommenting pieces of this code to create the Semaphore and Task dynamically and youll have to fill in the ???? with the proper names or values. Hey, we couldnt do ALL the work for you. Also notice in the global variable declaration area that there are two handles for the Sempahore and Task also provided. In order to use functions like Semaphore_create() and Task_create(), you will need to uncomment the necessary #include for the header files also.
10 - 18
So, in this example (C28x), the starting heap size was 0x100 (256) and 0xd0 is still free (208), so the Semaphore object took 48 16-bit locations on the C28x (assuming nothing else is on the heap). Ok. So, we didnt run out of heap. Good thing. Write down how many bytes your Semaphore required here: _____________ How much free size do you have left over? ____________ So, when you create a Task, which has its own stack, if you create it with a stack larger than the free size left over, what might happen? _______________________________________________________ Well, lets go try it
10 - 19
What happened? Two things. First, your heap is not big enough to create a Task from because the Task requires a stack that is larger than the entire heap. Also, did you pass an error block in the Task_create() function? Probably not. So, what happens if you get a NULL pointer back and you do NOT pass an error block? BIOS aborts. Well, thats what it looks like. 13. Open ROV to see the damage. Open ROV and click on Task. You should see something similar to this:
Look at the size of stackSize for ledToggle (name may or may not show up). This screen capture was for C28x, so your size may be different (probably larger). What size did you set the heap to in BIOS Runtime? __________ bytes What is the size of the stack needed for ledToggle (shown in ROV)? __________ bytes Get the picture? You need to increase the size of the heap
10 - 20
Lab 10 Procedure Using Dynamic Task/Sem 14. Go back and increase the size of the heap. Open BIOSRuntime and use the following heap sizes: C28x: C6000: MSP430: TM4C: 1024 4096 1024 4096
We probably dont need THIS large of a heap for this application it could be tuned better were just using a larger number to see the application work. Save app.cfg. 15. Wait, what about Error Block? In a real application, the user has a choice whether to use Error Block or not. For debug purposes, maybe it is best to leave it off so that your program aborts when the handle to the requested resource is NULL. If you dont like that, then use Error Block and check the return handle and deal with it however you choose user preference. In our lab, we chose to ignore Error Block, but at least you know it is there, how to initialize one and how it works. 16. Rebuild and run again. Rebuild and run the new project with the larger heap. Run for 5 blinks it should work fine now. 17. Terminate your debug session, close the project and close CCS.
Youre finished with this lab. Help a neighbor who is struggling you know you KNOW IT when you can help someone else and its being a good neighbor. But, if you want to be selfish and just leave the room because the workshop is OVER, no one will look at you funny !!
10 - 21
Additional Information
Additional Information
Placing a Specific Section into Memory
Via the Platform File (C6000 Only) hi-level, but works fine:
GUI
CFG script
10 - 22
Notes
Notes
10 - 23
More Notes
More Notes
10 - 24
C6000 Introduction
Introduction
This is the first chapter that specifically addresses ONLY the C6000 architecture. All chapters from here on assume the student has already taken the 2-day TI-RTOS Kernel workshop. During those past two days, some specific C6000 architecture items were skipped in favor of covering all TI EP processors with the same focus. Now, it is time to dive deeper into the C6000 specifics.
nd The first part of this chapter focuses on the C6000 family of devices. The 2 part dives deeper into topics already discussed in the previous two days of the TI-RTOS Kernel workshop. In a way, this chapter is catching up all the C6000 users to understand this target environment specifically.
After this chapter, we plan to dive even deeper into specific parts of the architecture like optimizations, cache and EDMA.
Objectives
Objectives
Introduce the C6000 Core and the C6748
target device
discussions are C6000-specific topics such as Interrupts, Platforms and Target Config Files an Hwi to respond to the audio interrupts
11 - 1
Module Topics
Module Topics
C6000 Introduction............................................................................................................... 11-1 Module Topics.................................................................................................................... 11-2 TI EP Product Portfolio ....................................................................................................... 11-3 DSP Core........................................................................................................................... 11-4 Devices & Documentation .................................................................................................. 11-6 Peripherals......................................................................................................................... 11-7 PRU ............................................................................................................................... 11-7 SCR / EDMA3 ............................................................................................................... 11-8 Pin Muxing ..................................................................................................................... 11-9 Example Device: C6748 DSP ........................................................................................... 11-11 Choosing a Device ........................................................................................................... 11-12 C6000 Arch Catchup...................................................................................................... 11-13 C64x+ Interrupts........................................................................................................... 11-13 Event Combiner............................................................................................................ 11-14 Target Config Files ....................................................................................................... 11-14 Creating Custom Platforms ........................................................................................... 11-15 Quiz ................................................................................................................................. 11-19 Quiz - Answers ............................................................................................................. 11-20 Using Double Buffers ....................................................................................................... 11-21 Lab 11: An Hwi-Based Audio System ............................................................................... 11-23 Lab 11 Procedure ...................................................................................................... 11-24 Hack LogicPDs BSL types.h ........................................................................................ 11-24 PART B (Optional) Using the Profiler Clock................................................................ 11-34 Additional Information....................................................................................................... 11-35 Notes ............................................................................................................................... 11-36
11 - 2
TI EP Product Portfolio
TI EP Product Portfolio
Microcontrollers (MCU) Application (MPU) TIs Embedded Processor Portfolio MSP430
16-bit Ultra Low Power & Cost
MSP430 ULP RISC MCU Low Pwr Mode 0.1 A 0.5 A (RTC) Analog I/F RF430 TI RTOS (SYS/BIOS) Flash: 512K FRAM: 64K 25 MHz $0.25 to $9.00
C2000
32-bit Real-time
Tiva-C
32-bit All-around MCU
Hercules
32-bit Safety
ARM Cortex-M3 Cortex-R4
Sitara
32-bit Linux Android
ARM Cortex-A8 Cortex-A9
DSP
16/32-bit All-around DSP
DSP C5000 C6000
Multicore
32-bit Massive Performance
C66 + C66 A15 + C66 A8 + C64 ARM9 + C674 Fix or Float Up to 12 cores 4 A15 + 8 C66x DSP MMACs:
352,000
Motor Control 32-bit Float Lock step $5 Linux CPU C5000 Low Vector Dual-core R4 Power DSP Digital Power Nested Int Ctrl (NVIC) ECC Memory 3D Graphics 32-bit fix/float Precision PRU-ICSS industrial subsys C6000 DSP Timers/PWM Ethernet (MAC+PHY) SIL3 Certified TI RTOS (SYS/BIOS) 512K Flash 300 MHz $1.85 to $20.00 TI RTOS (SYS/BIOS) 512K Flash 80 MHz $1.00 to $8.00 N/A 256K to 3M Flash 220 MHz $5.00 to $30.00 Linux, Android, C5x: DSP/BIOS C6x: SYS/BIOS SYS/BIOS L1: 32K x 2 L2: 256K 1.35 GHz $5.00 to $25.00 L1: 32K x 2 L2: 256K 800 MHz $2.00 to $25.00
11 - 3
DSP Core
DSP Core
What Problem Are We Trying To Solve?
ADC x
DSP
DAC
Y =
t
i = 1
coeffi * xi
C6x Compiler excels at Natural C Multiplier (.M) and ALU (.L) provide up to 8 MACs/cycle (8x8 or 16x16) Specialized instructions accelerate intensive, non-MAC oriented calculations. Examples include: Video compression, Machine Vision, Reed Solomon, While MMACs speed math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing C6x CPU can dispatch up to eight parallel instructions each cycle All C6x instructions are conditional allowing efficient hardware pipelining
.S1
.S2
MACs
.M1 . . A31
.M2 . . B31
.L1
.L2
Controller/Decoder
11 - 4
DSP Core
C674
C64x
Video/Imaging Enhanced EDMA2
C66x
1GHz EDMA (v2) 2x Register Set SIMD Instrs (Packed Data Proc )
C64x+
C674
C64x
L1 RAM and/or Cache Timestamp Counter Compact Instrs Exceptions Supervisor/User modes
Combined Instr Sets from C64x+/C67x+ Incr Floating-pt MHz Lower power EDMA3 PRU
C621x
C67x+
C62x C67x
C671x
11 - 5
C67x C67x
C672x DM643x C645x C6748 Future C667x C665x (new) C647x DM64xx, OMAP35x, DM37x OMAP-L138* C6A8168
C674
SPRUFE8 SPRUFK5 SPRUFK9 SPRUG82
C66x
SPRUGH7 SPRUGW0 N/A SPRUGY8 SPRA198 SPRAB27
SPRU198
To find a manual, at www.ti.com and enter the document number in the Keyword field:
SPRUEX3 - SYS/BIOS (v6) Users Guide Code Generation Tools SPRU186 SPRU187 - Assembly Language Tools Users Guide - Optimizing C Compiler Users Guide
or www.ti.com/lit/<litnum>
11 - 6
Peripherals
Peripherals
ARM Graphics Accelerator
Peripherals
Serial McBSP McASP ASP UART SPI I2C CAN Storage DDR2 DDR3 SDRAM Async ATA/CF SATA Master PCIe EMAC uPP EDMA3 SCR GPIO Timing Timers PWM eCAP RTC
PRU
Video/Display Subsytem
Capture Analog Display Digital Display LCD Controller
SD/MMC HPI
PRU
Programmable Realtime Unit (PRU)
PRU consists of:
2 Independent, Realtime RISC Cores Access to pins (GPIO) Its own interrupt controller Access to memory (master via SCR) Device power mgmt control (ARM/DSP clock gating)
Use as a soft peripheral to implement addl on-chip peripherals Examples implementations include:
Create custom peripherals or setup non-linear DMA moves. No C compiler (ASM only) Implement smart power controller:
Allows switching off both ARM and DSP clocks Maximize power down time by evaluating system events before waking up DSP and/or ARM
11 - 7
Peripherals
Is
IsNot
SCR / EDMA3
System Architecture SCR/EDMA
SCR Switched Central Resource Masters initiate accesses to/from slaves via the SCR Most Masters (requestors) and Slaves (resources) have their own port to the SCR Lower bandwidth masters (HPI, PCI66, etc) share a port There is a default priority (0 to 7) to SCR resources that can be modified.
Masters
ARM DSP
Slaves
C64 Mem DDR2 EMIF64 TCP VCP
EDMA3
TC0 CC TC1 TC2
PCI McBSP
Note: this picture is the general idea. Every device has a different scheme for SCRs and peripheral muxing. In other words check your data sheet.
Utopia
11 - 8
Peripherals
Pin Muxing
What is Pin Multiplexing?
Pin Mux Example
HPI uPP
How many pins are on your device? How many pins would all your peripheral require? Pin Multiplexing is the answer only so many peripherals can be used at the same time in other words, to reduce costs, peripherals must share available pins Which ones can you use simultaneously?
Designers examine app use cases when deciding best muxing layout Read datasheet for final authority on how pins are muxed Graphical utility can assist with figuring out pin-muxing
11 - 9
Peripherals
Graphical Utilities For Determining which Peripherals can be Used Simultaneously Provides Pin Mux Register Configurations. Warns user about conflicts. ARM-based devices: www.ti.com/tool/pinmuxtool others: see product page
11 - 10
TMS320C6748 Performance & Memory Up to 456MHz EDMA3 4-32x 256K L2 (cache/SRAM) PLL 32K L1P/D Cache/SRAM 128 16-bit DDR2-266 32KB L1P Cache/SRAM 16-bit EMIF (NAND Flash) 256 C674x+ DSP Core Communications 64-Channel EDMA 3.0 10/100 EMAC USB 1.1 & 2.0 SATA
256K L2
Pin-to-pin
PBGA
uPP
11 - 11
Choosing a Device
Choosing a Device
DSP & ARM MPU Selection Tool
http://focus.ti.com/en/multimedia/flash/selection_tools/dsp/dsp.html
11 - 12
2. Interrupt Selector
124+4 12
6. CPU Acknowledge
#2 Interrupt Selector (choose which 12 of 128 interrupt sources to use) #4 Interrupt Enable Register (IER) individually enable the proper interrupt sources #5 Global Interrupt Enable (GIE/NMIE) globally enable all interrupts
IFR
0 1 0
IER
GIE
MCASP0_INT
0 .
127
. .
Vector Table
2 3 4
C6748 has 128 possible interrupt sources (but only 12 CPU interrupts) 4-Step Programming:
1. 2. 3. 4. Interrupt Selector choose which of the 128 sources are tied to the 12 CPU ints IER enable the individual interrupts that you want to listen to (in BIOS .cfg) GIE enable global interrupts (turned on automatically if BIOS is used) Note: HWI Dispatcher performs smart context save/restore (automatic for BIOS Hwi)
Note: NMIE must also be enabled. BIOS automatically sets NMIE=1. If BIOS is NOT used, the user must turn on both GIE and NMIE manually.
11 - 13
Event Combiner
ECM combines multiple events (e.g. 4-31) into one event (e.g. EVT0) EVTx ISR must parse MEVTFLAG to determine which event occurred
EVTFLAG[0]
Occur?
EVT 4-31
EVTMASK[0]
Care?
MEVTFLAG[0]
Both Yes?
Interrupt Selector
EVT 32-63
EVTFLAG[1]
EVTMASK[1] MEVTFLAG[1]
EVT 64-95
EVTFLAG[2]
EVTMASK[2] MEVTFLAG[2]
C P U
EVT 96-127
EVTFLAG[3]
EVTMASK[3] MEVTFLAG[3]
EVT 4-127
Target Configuration defines your target i.e. emulator/device used, GEL scripts (replaces the old CCS Setup) Create user-defined configurations (select based on chosen board)
Advanced Tab
click
11 - 14
A GEL file is basically a batch file that sets up the CCS debug environment including: Memory Map Watchdog UART Other periphs
The board manufacturer (e.g. SD or LogicPD) supplies GEL files with each board. To create a stand-alone or bootable system, the user must write code to perform these actions (optional chapter covers these details)
Most users will want to create their own custom platform package (Stellaris/c28X maybe not they will use a .cmd file directly) Here is the process:
1. Create a new platform package 2. Select repository, add to project path, select device 3. Import the existing seed platform 4. Modify settings 5. [Save] creates a custom platform pkg 6. Build Options select new custom platform
11 - 15
Custom Repository vs. XDC default location Add Repository to Path adds platform path to project path
11 - 16
11 - 17
C6000 Arch Catchup *** this page is blank for absolutely no reason ***
11 - 18
Quiz
Quiz
1. How many functional units does the C6000 CPU have? 2. What is the size of a C6000 instruction word? 3. What is the name of the main bus arbiter in the architecture? 4. What is the main difference between a bus master and slave? 5. Fill in the names of the following blocks of memory and bus:
256
Chapter Quiz
CPU
128
11 - 19
Quiz
Quiz - Answers
Chapter Quiz
3. What is the name of the main bus arbiter in the architecture? 4. What is the main difference between a bus master and slave?
Masters can initiate a memory transfer (e.g. EDMA, CPU)
S C R L2
256
CPU
128
L1D
11 - 20
Double buffer system: process and collect data real-time compliant! Hwi BUF y
Swi/Task BUF x
Swi/Task
One buffer can be processed while another is being collected When Swi/Task finishes buffer, it is returned to Hwi Task is now caught up and meeting real-time expectations Hwi must have priority over Swi/Task to get new data while prior data is being processed standard in SYS/BIOS
11 - 21
Using Double Buffers *** this page is also blank please stop staring at blank pagesit is not healthy ***
11 - 22
Application: Audio pass-thru using Hwi and McASP/AIC3106 Key Ideas: Hwi creation, Hwi conditions to trigger an interrupt, Ping-Pong memory management Pseudo Code:
main() init BSL, init LED, return to BIOS scheduler isrAudio() responds to McASP interrupt, read data from RCV XBUF put in RCV buffer, acquire data from XMT buffer, write to XBUF. When buffer is full, copy RCV to XMT buffer. Repeat. FIR_process() memcpy RCV to XMT buffer. Dummy algo for FIR later on
mcasp.c aic3106.c
(48 KHz) Audio Output (48 KHz)
Double Buffers
Audio Input
ADC AIC3106
DAC AIC3106
McASP
XBUF11
datIn = XBUF12 pIn[cnt] = datIn datOut = pOut[cnt] XBUF11 = datOut if (cnt >= BUF){ Copy RCVXMT }
rcvPing
xmtPing
Hwi RCV COPY XMT
Procedure
1. 2. 3. 4. Import existing project (Lab11) Create your own CUSTOM PLATFORM Config Hwi to respond to McASP interrupt Debug Interrupt Problems
Time = 45min
11 - 23
Lab 11 Procedure
If you cant remember how to perform some of these steps, please refer back to the previous labs for help. Or, if you really get stuck, ask your neighbor. If you AND your neighbor are stuck, then ask the instructor (who is probably doing absolutely NOTHING important) for help.
This starter file contains all the starting source files for the audio project including the setup code for the A/D and D/A on the OMAP-L138 target board. It also has UIA activated. 3. Check the Properties to ensure you are using the latest XDC, BIOS and UIA. For every imported project in this workshop, ALWAYS check to make sure the latest tools (XDC, BIOS and UIA) are being used. The author created these projects at time x and you may have updated the tools on your student PC at x+1 some time later. The author used the tools available at time x to create the starter projects and solutions which may or may not match YOUR current set of tools. Therefore, you may be importing a project that is NOT using the latest versions of the tools (XDC, BIOS, UIA) or the compiler. Check ALL settings for the Properties of the project (XDC, BIOS, UIA) and the compiler and update the imported project to the latest tools before moving on and save all settings.
Close types.h. Now that this file is hacked, you will be able to use Logic PDs types.h for all future labs without a ton of warnings when you build.
11 - 24
Several source files are needed to create this application. Lets explore those briefly
11 - 25
Notice that we have separate buffers for Ping and Pong for both RCV and XMT. Where is BUFFSIZE defined? Main.h. Well see him in a minute. As you go into main(), youll see the zeroing of the buffers to provide initial conditions of ZERO. Think about this for a minute. Is that ok? Well, it depends on your system. If BUFFSIZE is 256, that means 256 ZEROs will be transmitted to the DAC during the first 256 interrupts. What will that sound like? Do we care? Some systems require solid initial conditions so keep that in mind. We will just live with the zeros for now. Then, youll see the calls to the init routines for the McASP and AIC3106. Previously, with DSP/BIOS, this is where an explicit call to init interrupts was located. However, with SYS/BIOS, this is done via the GUI. Lastly, there is a call to McASP_Start(). This is where the McASP is taken out of reset and the clocks start operating and data starts being shifted in/out. Soon thereafter, we will get the first interrupt. 8. Open mcasp_TTO.c for editing. This file is responsible for initializing and starting the McASP hence, two functions (init and start). In particular, look at line numbers 83 and 84 (approximately). This is where the serializers are chosen. This specifies XBUF11 (XMT) and XBUF12 (RCV). Also, look at line numbers 111-114. This is where the McASP interrupts are enabled. So, if they are enabled correctly, we should get these interrupts to fire to the CPU.
11 - 26
Lab 11: An Hwi-Based Audio System 9. Open isr.c for editing. Well, this is where all the real work happens inside the ISR. This code should look pretty familiar to you already. There are 3 key concepts to understand in this code: Ping/Pong buffer management notice that two local pointers are used to point to the RCV/XMT buffers. This was done as a pre-cursor to future labs but works just fine here too. Notice at the top of the function that the pointers are initialized only if blkCnt is zero (i.e it is time to switch from ping to pong buffers or vice versa) and were done with the previous block. blkCnt is used as an index into the buffers. McASP reads/writes refer to the read/write code in the middle. When an interrupt occurs, we dont know if it was the RRDY (RCV) or XRDY (XMT) bit that triggered the interrupt. We must first test those bits, then perform the proper read or write accordingly. On EVERY interrupt, we EITHER read one sample and write one sample. All McASP reads and writes are 32 bits. Period. Even if your word length is 16 bits (like ours is). Because we are MSB first, the 16-bits of interest land in the UPPER half of the 32-bits. We turned on ROR (rotate-right) of 16 bits on rcv/xmt to make our code look more readable (and save time vs. >> 16 via the compiler). At the end of the block what happens? Look at the bottom of the code. When BUFFSIZE is reached, blkCnt is zerod and the pingPong Boolean switches. Then, a call to FIR_process() is made that simply copies RCV buffer to XMT buffer. Then, the process happens all over again for the other (PING or PONG) buffers.
10. Open fir.c for editing. This is currently a placeholder for a future FIR algorithm to filter our audio. We are simply pass through the data from RCV to XMT. In future labs, a FIR filter written in C will magically appear and well analyze its performance quite extensively. 11. Open main.h for editing. main.h is actually a workhorse. It contains all of the #includes for BSL and other items, #defines for BUFFSIZE and PING/PONG, prototypes for all functions and externs for all variables that require them. Whenever you are asked to change BUFFSIZE, this is the file to change it in.
11 - 27
When the following dialogue appears: Give your platform a name: evmc6748_student (the author used _TTO for his) Point the repository to the path shown (this is where the platform package is stored) Then select the Device Family/Name as shown Check the box Add Repository to Project Package Path (so we can find it later). When you check this box, select your current project in the listing that pops up. This also adds this repository to the list of Repositories in the Properties General RTSC tab dialogue.
Click Next.
11 - 28
Lab 11: An Hwi-Based Audio System When the new platform dialogue appears, click the IMPORT button to copy the seed file we used before:
This will copy all of the initial default settings for the board and then we can modify them. A dialogue box should pop up and select the proper seed file as shown ( select the _TTO version of the platform file that the author already created for you):
Modify the memory settings to allocate all code, data and stacks into internal memory (IRAM) as shown. They may already be SET this way just double check. BEFORE YOU SAVE HAVE THE INSTRUCTOR CHECK THIS FILE. Then save the new platform. This will build a new platform package.
13. Tell the tools to use this new custom platform in your project. We have created a new platform file, but we have not yet ATTACHED it to our project. When the project was created, we were asked to specify a platform file and we chose the default seed platform. How do we get back to the configuration screen? Right-click on the project and select Properties General and then select the RTSC tab. Look near the bottom and youll see that the default seed platform is still specified. We need to change this. `Click on the down arrow next to the Platform File. The tools should access your new repository with your new custom platform file: evmc6748_student.
evmc6748_student
Select YOUR STUDENT PLATFORM FILE and click Ok. Now, your project is using the new custom platform. Very nice
11 - 29
Make sure Enabled at startup is NOT checked (this sets the IER bit
on the C6748). This will provide us with something to debug later. Once again, you can click on the new HWI and see the corresponding Source script code.
11 - 30
17. McASP interrupt firing IFR bit set? The McASP interrupt is set to fire properly, but is it setting the IFR bit? You configured HWI_INT5, so that would be a 1 in bit 5 of the IFR. Go there now (View Registers Core Registers). `Look down the list to find the IFR and IER the two of most interest at the moment. (author note: could it have been set, then auto-cleared already?). You can also DISABLE IERbit (as it is already in the CFG file), build/run, and THEN look at IFR (this is a nice trick). Write your debug checkmarks here: IFR bit set?
Yes
No
11 - 31
Lab 11: An Hwi-Based Audio System 18. Is the IER bit set? Interrupts must be individually enabled. When you look at IER bit 5, is it set to 1? Probably NOT because we didnt check that Enable at Start checkbox. Open up the config for HWI_INT5 and check the proper checkbox. Then, hit build and your code will build and load automatically regardless of which perspective you are in. IER bit set? 19. Is GIE set? The Global Interrupt Enable (GIE) Bit is located in the CPUs CSR register. SYS/BIOS turns this on automatically and then manages it as part of the O/S. So, no need to check on this. GIE bit set? Hint:
Yes
No
Do you hear audio now? You probably should. But lets check one more thing
Yes
No
If you create a project that does NOT use SYS/BIOS, it is the responsibility of the user to not only turn on GIE, but also NMIE in the CSR register. Otherwise, NO interrupts will be recognized. Ever. Did I say ever?
So, try this now. Run your code and halt (pause). Run again. Do you hear audio? Nope. Click the restart button and run again. Now it should work. These will be handy tips for all lab steps now and in the future.
11 - 32
RAISE YOUR HAND and get the instructors attention when you have completed PART A of this lab. If time permits, you can quickly do the next optional part
11 - 33
In the bottom right-hand part of the screen, you should see a little CLK symbol that looks like this:
Run to the first breakpoint, then double-click on the clock symbol to zero it. Run again and the number of CPU cycles will display.
11 - 34
Additional Information
Additional Information
11 - 35
Notes
Notes
11 - 36
Objectives
Objectives
12 - 1
Module Topics
Module Topics
C64x+/C674x+ CPU Architecture ........................................................................................... 9-1 Module Topics...................................................................................................................... 9-2 What Does A DSP Do? ........................................................................................................ 9-3 CPU From the Inside Out ............................................................................................ 9-4 Instruction Sets .................................................................................................................. 9-10 MAC Instructions ............................................................................................................. 9-12 C66x MAC Instructions ................................................................................................. 9-14 Hardware Pipeline .............................................................................................................. 9-15 Software Pipelining ............................................................................................................ 9-16 Chapter Quiz...................................................................................................................... 9-19 Quiz - Answers ............................................................................................................... 9-20
12 - 2
DSP
DAC
Y =
t
i = 1
coeffi * xi
C6x Compiler excels at Natural C Multiplier (.M) and ALU (.L) provide up to 8 MACs/cycle (8x8 or 16x16) Specialized instructions accelerate intensive, non-MAC oriented calculations. Examples include: Video compression, Machine Vision, Reed Solomon, While MMACs speed math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing C6x CPU can dispatch up to eight parallel instructions each cycle All C6x instructions are conditional allowing efficient hardware pipelining
.S1
.S2
MACs
.M1 . . A31
.M2 . . B31
.L1
.L2
Controller/Decoder
12 - 3
Mult .M
n = 1
cn * xn
c, x, prod y, prod, y
The C6000
Designed to handle DSPs math-intensive calculations
ALU .L
.M .L
n = 1
cn * xn
c, x, prod y, prod, y
16 or 32 registers
.M .L
. . .
12 - 4
Making Loops
1. Program flow: the branch instruction
B loop
y =
MVK
40
n = 1
cn * xn
40, cnt
16 or 32 registers
.S
. . .
12 - 5
[condition]
loop
Execution based on [zero/non-zero] value of specified variable Code Syntax [ cnt ] [ !cnt ] Execute if: cnt 0 cnt = 0
y =
MVK
40
n = 1
cn * xn
40, cnt
16 or 32 registers
.S
. . .
32-bits How are the c and x array values brought in from memory?
12 - 6
y =
MVK
40
n = 1
cn * xn
40, cnt *cp *xp ,c ,x
16 or 32 registers
.S .D .D .M .L .L .S
16 or 32 registers
Access Unit Instr. Memory Description C via Type .D Size LDB load byte char 8-bits LDH load half-word short 16-bits 40 Register File A LDW load word int 32-bits y = cn * xn c n = 1 .S LDDW* x load double-word double 64-bits
cnt * Except C62x & C67x generations prod y *cp *xp *yp
MVK .S
40, cnt ,c ,x
.M .L .D
12 - 7
Auto-Increment of Pointers
Register File A c x cnt prod y *cp *xp *yp .S .M .L .D
Data Memory: x(40), a(40), y loop: LDH LDH MPY ADD SUB [cnt] B .D .D .M .L .L .S *cp++, c *xp++, x c, x, prod y, prod, y cnt, 1, cnt loop
y =
MVK
40
n = 1
cn * xn
40, cnt
16 or 32 registers
.S
y =
MVK
40
n = 1
cn * xn
40, cnt
16 or 32 registers
.S
12 - 8
Register File B .S1 .M1 .L1 .D1 .S2 .M2 .L2 .D2 . .
32-bits
B0 B1 B2 B3 B4 B5 B6 B7 . . B15 B31
or
y = .S1
loop: MVK LDH LDH MPY ADD SUB [A2] B STW
40
n = 1
cn * xn
40, A2 *A5++, A0 *A6++, A1 A0, A1, A3 A4, A3, A4 A2, 1, A2 loop A4, *A7
Its easier to use symbols rather than register names, but you can use either method.
12 - 9
Instruction Sets
Instruction Sets
C62x RISC-like instruction set
.S .L .D .M
ADD ADDK ADD2 AND B CLR EXT MV MVC MVK MVKH NEG NOT OR SET SHL SHR SSHL SUB SUB2 XOR ZERO
.S Unit
.L Unit
.M Unit .D Unit
ADD NEG ADDAB (B/H/W) STB (B/H/W) LDB (B/H/W) SUB SUBAB (B/H/W) MV ZERO MPY MPYH MPYLH MPYHL NOP SMPY SMPYH
No Unit Used
IDLE
.S Unit
ABSSP ABSDP CMPGTSP CMPEQSP CMPLTSP CMPGTDP CMPEQDP CMPLTDP RCPSP RCPDP RSQRSP RSQRDP SPDP
.L Unit
ADDSP ADDDP SUBSP SUBDP INTSP INTDP SPINT DPINT SPRTUNC DPTRUNC DPSP MPYSP MPYDP MPYI MPYID IDLE
.M Unit
MPY MPYH MPYLH MPYHL NOP SMPY SMPYH
.D Unit
ADD NEG ADDAB (B/H/W) STB (B/H/W) LDB (B/H/W) SUB LDDW SUBAB (B/H/W) MV ZERO
No Unit Required
12 - 10
Instruction Sets
.L
Dual/Quad Arith ABS2 ADD2 ADD4 MAX MIN SUB2 SUB4 SUBABS4 Bitwise Logical ANDN Shift & Merge SHLMB SHRMB
Data Pack/Un PACK2 PACKH2 PACKLH2 PACKHL2 PACKH4 PACKL4 UNPKHU4 UNPKLU4 SWAP2/4
.D
.M
Average AVG2 AVG4 Shifts ROTL SSHVL SSHVR
Multiplies MPYHI MPYLI MPYHIR MPYLIR Load Constant MPY2 MVK (5-bit) SMPY2 Bit Operations DOTP2 DOTPN2 BITC4 DOTPRSU2 BITR DOTPNRSU2 DEAL DOTPU4 SHFL DOTPSU4 Move GMPY4 MVD XPND2/4
C64x+ Additions
.S
CALLP DMV RPACK2
None
DINT RINT SPKERNEL SPKERNELR SPLOOP SPLOOPD SPLOOPW SPMASK SPMASKR SWE SWENR
.L
.D
None
.M
CMPY CMPYR CMPYR1 DDOTP4 DDOTPH2 DDOTPH2R DDOTPL2 DDOTPL2R GMPY MPY2IR MPY32 (32-bit result) MPY32 (64-bit result) MPY32SU MPY32U MPY32US SMPY32 XORMPY
12 - 11
MAC Instructions
MAC Instructions
DOTP2 with LDDW
a3 x3 a2 x2
a1 x1
a0 x0
A1:A0 B1:B0
B2
a3*x3 + a2*x2
A2
a1*x1 + a0*x0
B3
intermediate sum
intermediate sum A5
A3
final sum
A4
Four 16x16 multiplies In each .M unit every cycle -------------------------------------adds up to 8 MACs/cycle, or 8000 MMACS Bottom Line: Two loop iterations for the price of one
12 - 12
MAC Instructions
single .M unit
Four 16x16 multiplies per .M unit Using two CMPYs, a total of eight 16x16 multiplies per cycle Floating-point version (CMPYSP) uses:
64-bit inputs (register pair) 128-bit packed products (register quad) You then need to add/subtract the products to get the final result
12 - 13
: c1*x1 : 32-bits
single .M unit
Four 32x32 multiplies per .M unit Total of eight 32x32 multiplies per cycle Fixed or floating-point versions Output is 128-bit packed result (register quad)
Single .M unit implements complex matrix multiply using 16 MACs (all in 1 cycle) Achieve 32 16x16 multiplies per cycle using both .M units
src1 src2 dest r1
src2_3
i1 ia
: :
:
r2
src2_2
i2 ib :
:
ra
rb
src2_1
rc
ic
:
:
src2_0
rd
id
32-bits
32-bits
32-bits
32-bits
single .M unit
12 - 14
Hardware Pipeline
Hardware Pipeline
Pipeline Phases
Program Fetch
PG PS PW PR
Decode
DP DC
Execute
E1
PG
PS
PG
PW
PS
PR
PG
PW
DP
PS
PR
DC
PG
PW
DP
E1
PS
PR
DC
PG
PW
DP
E1
PS
PR
DC
PG
PW
DP
E1
PS
PR
DC
PW
DP
E1
PR
DC
DP
E1
DC
E1
Pipeline Full
Pipeline Phases
Full Pipe
12 - 15
Software Pipelining
Software Pipelining
Instruction Delays
All 'C64x instructions require only one cycle to execute, but some results are delayed ...
Description Single Cycle Multiply Load Branch # Instr. All, instrs except ... MPY, SMPY LDB, LDH, LDW B Delay 0 1 4 5
y = .S1
loop: MVK LDH LDH MPY ADD SUB [A2] B STW
40
n = 1
cn * xn
40, A2 *A5++, A0 *A6++, A1 A0, A1, A3 A4, A3, A4 A2, 1, A2 loop A4, *A7
Need to add NOPs to get this code to work properly NOP = Not Optimized Properly How many instructions can this CPU execute every cycle?
12 - 16
Software Pipelining
LOOP
B2 sub3
B3 sub4
ldw4 ldw4
ldw5 ldw5
B5 sub6 mpy2
3 add 6 add
B6 sub7 mpy3
c1: ldw .D1 || ldw .D2 || [B0] sub .S2 c2_3_4: || || [B0] || [B0] ldw .D1 ldw .D2 sub .S2 B .S1 . . .
12 - 17
12 - 18
Chapter Quiz
Chapter Quiz
Chapter Quiz
1. Name the four functional units and types of instructions they execute:
2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ? 3. Where are CPU operands stored and how do they get there? 4. What is the purpose of a hardware pipeline?
5. What is the purpose of s/w pipelining, which tool does this for you?
12 - 19
Chapter Quiz
Quiz - Answers
Chapter Quiz
1. Name the four functional units and types of instructions they execute:
M unit Multiplies (fixed, float) L unit ALU arithmetic and logical operations S unit Branches and shifts D unit Data loads and stores
2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ?
C674x 8 MACs/cycle, C66x 32 MACs/cycle
3. Where are CPU operands stored and how do they get there?
Register Files (A and B), Load (LDx) data from memory
5. What is the purpose of s/w pipelining, which tool does this for you?
Maximize performance use as many functional units as possible in every cycle, the COMPILER/OPTIMIZER performs SW pipelining
12 - 20
Outline
Objectives
Describe how to configure and use the
various compiler/optimizer options performance or reduce code size
Discuss the key techniques to increase Demonstrate how to use optimized libraries Overview key system optimizations Lab 13 Use FIR algo on audio data,
optimize using the compiler, benchmark
13 - 1
Module Topics
Module Topics
C and System Optimizations ................................................................................................................... 13-1 Module Topics ........................................................................................................................................ 13-2 Introduction Optimal and Optimization............................................................................................ 13-3 C Compiler and Optimizer ...................................................................................................................... 13-5 Debug vs. Optimized ..................................................................................................................... 13-5 Levels of Optimization ........................................................................................................................ 13-6 Build Configurations ........................................................................................................................... 13-7 Code Space Optimization (ms) ........................................................................................................ 13-7 File and Function Specific Options ..................................................................................................... 13-8 Coding Guidelines .............................................................................................................................. 13-9 Data Types and Alignment ................................................................................................................... 13-10 Data Types ....................................................................................................................................... 13-10 Data Alignment................................................................................................................................. 13-11 Using DATA_ALIGN ......................................................................................................................... 13-12 Upcoming Changes ELF vs. COFF ............................................................................................... 13-13 Restricting Memory Dependencies (Aliasing)....................................................................................... 13-14 Access Hardware Features Using Intrinsics ...................................................................................... 13-16 Give Compiler MORE Information ........................................................................................................ 13-17 Pragma Unroll()............................................................................................................................. 13-17 Pragma MUST_ITERATE() ........................................................................................................... 13-18 Keyword - Volatile ............................................................................................................................ 13-18 Setting MAX interrupt Latency (-mi option)....................................................................................... 13-19 Compiler Directive - _nassert()......................................................................................................... 13-20 Using Optimized Libraries .................................................................................................................... 13-21 Libraries Download and Support ................................................................................................... 13-23 System Optimizations........................................................................................................................... 13-24 BIOS Libraries .................................................................................................................................. 13-24 Custom Sections .............................................................................................................................. 13-26 Use Cache ....................................................................................................................................... 13-27 Use EDMA ....................................................................................................................................... 13-28 System Architecture SCR.............................................................................................................. 13-29 Chapter Quiz ........................................................................................................................................ 13-31 Quiz - Answers ................................................................................................................................. 13-32 Lab 13 C Optimizations ..................................................................................................................... 13-33 Lab 13 C Optimizations Procedure................................................................................................. 13-34 PART A Goals and Using Compiler Options ................................................................................. 13-34 Determine Goals and CPU Min.................................................................................................... 13-34 Using Debug Configuration (g, NO opt) ..................................................................................... 13-35 Using Release Configuration (o2, no g)................................................................................... 13-36 Using Opt Configuration ............................................................................................................ 13-38 Part B Code Tuning....................................................................................................................... 13-40 Part C Minimizing Code Size (ms) .............................................................................................. 13-43 Part D Using DSPLib..................................................................................................................... 13-44 Conclusion ....................................................................................................................................... 13-45 Additional Information........................................................................................................................... 13-46 Notes ................................................................................................................................ 13-48
13 - 2
Y =
i = 1
coeffi * xi
Goals:
A typical goal of any systems algo is to meet real-time You might also want to approach or achieve CPU Min in order to maximize #channels processed The minimum # cycles the algo takes based on architectural limits (e.g. data size, #loads, math operations required) Often, meeting real-time only requires setting a few compiler options (easy) However, achieving CPU Min often requires extensive knowledge of the architecture (harder, requires more time)
13 - 3
13 - 4
FIR
Dot Product
for (i = 0; i < count; i++){ Y += coeff[i] * x[i]; }
Benchmarks:
Algo Debug (no opt, g) Opt (-o3, no g) Addl pragmas (DSPLib) CPU Min
Debug get your code LOGICALLY correct first (no optimization) Opt increase performance using compiler options (easier) CPU Min it depends. Could require extensive time
Provides the best debug environment with full symbolic support, no code motion, easy to single step Code is NOT optimized i.e. very poor performance Create test vectors on FUNCTION boundaries (use same vectors as Opt Env)
Higher levels of opt results in code motion functions become black boxes (hence the use of FXN vectors) Optimizer can find errors in your code (use volatile) Highly optimized code (can reach CPU Min w/some algos) Each level of optimization increases optimizers scope
13 - 5
Levels of Optimization
FILE1.C { { } { } } { } . . . Increasing levels of opt: scope, code motion build times visibility ...
Levels of Optimization
-o0, -o1
single block
-o2
-o3
-pm -o3
LOCAL
FUNCTION
across blocks across functions
FILE
PROGRAM
across files
FILE2.C { . . . }
13 - 6
Build Configurations
13 - 7
13 - 8
Coding Guidelines
Programming the C6000
Source
C C ++
Compiler Optimizer
Efficiency* Effort
80 - 100% Low
Linear ASM
Assembly Optimizer
95 - 100%
Med
ASM
Technical Training Organization
Hand Optimize
100%
High
T TO
13 - 9
13 - 10
Data Alignment
13 - 11
Using DATA_ALIGN
13 - 12
Starting with v7.2.0 the C6000 Code Gen Tools (CGT) will begin shipping two versions of the Linker:
1. 2.
COFF: ELF:
Binary file-format used by TI tools for over a decade New binary file-format which provides additional features like dynamic/relocatable linking
v7.3.x default may become ELF (prior to this, choose ELF for new features) Continue using COFF for projects already in progress using --abi=coffabi compiler option (support will continue for a long time) Your programs binary files (.obj, .lib) must all be built with the same format If building libraries used for multiple projects, we recommend building two libraries one with each format EABI longs are 32 bits; new TI type (__int40_t) created to support 40 data COFF adds a leading underscore to symbol names, but the EABI does not See: http://processors.wiki.ti.com/index.php/C6000_EABI_Migration
Migration Issues
13 - 13
13 - 14
Aliasing?
What happens if the function is called like this? fcn(*myVector, *myVector+1)
in a b c d e ...
in + 4
void fcn(*in, *out) { LDW *in++, A0 ADD A0, 4, A1 STW A1, *out++ }
Definitely Aliased pointers *in and *out could point to the same address But how does the compiler know? If you tell the compiler there is no aliasing, this code will break (LDs in software pipelined loop) One solution is to restrict the writes - *out (see next slide)
13 - 15
13 - 16
Pragma Unroll()
13 - 17
Pragma MUST_ITERATE()
4. MUST_ITERATE(min,
#pragma UNROLL(2);
max, %factor)
Gives the compiler information about the trip (loop) count In the code above, we are promising that:
count >= 10, count <= 100, and count % 2 == 0
If you break your promise, you might break your code MIN helps with code size and software pipelining MULT allows for efficient loop unrolling (and odd cases) The #pragma must come right before the for() loop
Keyword - Volatile
13 - 18
-mi0
-mi Details
-mi1
Compilers code is not interruptible User must guarantee no interrupts will occur Compiler uses single assignment and never produces a loop less than 6 cycles Tells the compiler your system must be able to see interrupts every 1000 cycles Compiler will software pipeline (when using o2 or o3) Interrupts are disabled for s/w pipelined loops Notes: Be aware that the compiler is unaware of issues such as memory wait-states, etc. Using mi, the compiler only counts instruction cycles
13 - 19
MUST_ITERATE Example
int dot_prod(short *a, Short *b, int n) { int i, sum = 0; #pragma MUST_ITERATE ( ,512) for (i = 0; i < n; i++) sum += a[i] * b[i]; return sum; }
Provided: If interrupt threshold was set at 1000 cycles (-mi 1000), Assuming this can compile as a single-cycle loop, And 512 = max# for Loop count (per MUST_ITERATE pragma). Result: The compiler knows a 1-cycle kernel will execute no more than 512 times which is less than the 1000 cycle interrupt disable option (mi1000) Uninterruptible loop works fine Verdict: 3072 cycle loop (512 x 6) can become a 512 cycle loop
13 - 20
13 - 21
13 - 22
13 - 23
System Optimizations
13 - 24
System Optimizations
13 - 25
System Optimizations
Custom Sections
13 - 26
System Optimizations
Use Cache
13 - 27
System Optimizations
Use EDMA
Using EDMA
Internal RAM
0x8000
CPU
EDMA
EMIF
Program the EDMA to automatically transfer data/code from one location to another. Operation is performed WITHOUT CPU intervention All details covered in a later chapter
Master Periph
DMA Enhanced DMA (version 3) DMA to/from peripherals Can be syncd to peripheral events Handles up to 64 events
QDMA Quick DMA DMA between memory Async must be started by CPU 4-8 channels available
Both Share (number depends upon specific device) 128-256 Parameter RAM sets (PARAMs) 64 transfer complete flags 2-4 Pending transfer queues
13 - 28
System Optimizations
SCR Switched Central Resource Masters initiate accesses to/from slaves via the SCR Most Masters (requestors) and Slaves (resources) have their own port to SCR Lower bandwidth masters (HPI, PCIe, etc) share a port There is a default priority (0 to 7) to SCR resources that can be modified:
SRIO, HOST (PCI/HPI), EMAC TC0, TC1, TC2, TC3 CPU accesses (cache misses) Priority Register: MSTPRI
Masters
SRIO CPU TC0 CC TC1 TC2 TC3 PCIe HPI EMAC
Slaves
C64 Mem DDR2 EMIF64 TCP
SCR
VCP
13 - 29
System Optimizations *** this page is blank so why are you staring at it? ***
13 - 30
Chapter Quiz
Chapter Quiz
Chapter Quiz
1. How do you turn ON the optimizer ? 2. Why is there such a performance delta between Debug and Opt ? 3. Name 4 compiler techniques to increase performance besides -o? 4. Why is data alignment important? 5. What is the purpose of the mi option? 6. What is the BEST feedback mechanism to test compilers efficiency?
13 - 31
Chapter Quiz
Quiz - Answers
Chapter Quiz
1. How do you turn ON the optimizer ?
Project -> Properties, use o2 or o3 for best performance
13 - 32
Lab 13 C Optimizations
Lab 13 C Optimizations
In the following lab, you will gain some experience benchmarking the use of optimizations using the C optimizer switches. While your own mileage may vary greatly, you will gain an understanding of how the optimizer works and where the switches are located and their possible affects on speed and size.
13 - 33
So, the CPU Min = 16384/8 = ~2048 cycles + overhead. If you look at the inner loop (which is a simple dot product, it will take 64/8 cycles = 8 cycles per inner loop. Add 8 cycles overhead for prologue and epilogue (pre-loop and post-loop code), so the inner loop is 16 cycles. Multiply that by the buffer size = 256, so the approximate CPU min = 16*256 = 4096. CPU Min = 4096 cycles. 3. Import Lab 13 Project. Import Lab 13 Project from \Labs\Lab13 folder. Change the build properties to use YOUR student platform file and ensure the latest BIOS/XDC/UIA tools are selected. 4. Analyze new items FIR_process and COEFFs Open fir.c. You will notice that this file is quite different. It has the same overall TSK structure (Semaphore_pend, if ping/pong, etc). Notice that after the if(pingPong), we process the data using a FIR filter. Scroll on down to cfir(). This is a simple nested for() loop. The outer loop runs once for every block size (in our case, this is DATA_SIZE). The inner loop runs the size of COEFFS[] times (in our case, 64). Open coeffs.c. Here you will see the coefficients for the symmetric FIR filter. There are 3 sets low-pass, hi-pass and all-pass. Well use the low-pass for now.
13 - 34
About 913K cycles. Whoa. Maybe we need to OPTIMIZE this thing. What were your results? Write the down below: Debug (-g, no opt) benchmark for cfir()? _________________ cycles Did we meet our real-time goal (music sounding fine?): ____________ Can anyone say heck no. The audio sounds terrible. We have failed to meet our only realtime goal. But hey, its using the Debug Configuration. And if we wanted to single step our code, we can. It is a very nice debug-friendly environment although the performance is abysmal. This is to be expected.
13 - 35
Lab 13 C Optimizations Procedure 8. Check Semaphore count of mcaspReadySem. If the semaphore count for mcaspReadySem is anything other than ZERO after the Semaphore_pend in FIR_process(), we have troubles. This will indicate that we are NOT keeping up with real time. In other words, the Hwi is posting the semaphore but the processing algorithm is NOT keeping up with these posts. Therefore, if the count is higher than 0, then we are NOT meeting realtime. Use ROV and look at the Semaphore module. Your results may vary, but youll see the semaphore counts pretty high (darn, even ledToggleSem is out of control):
My goodness a number WELL greater than zero. We are definitely not meeting realtime. 9. View Debug compiler options. FYI if you looked at the options for the Debug configuration, youd see the following:
Full symbolic debug is turned on and NO optimizations. Ok, nice fluffy debug environment to make sure were getting the right answers, but not good enough to meet realtime. Lets kick it up a notch
Check Properties Include directory. Make sure the BSL \inc folder is specified. Also, double-check your PLATFORM file. Make sure all code/data/stacks are in internal memory and that your project is USING the proper platform in this NEW build configuration. Once again, these configurations are containers of options. Even though Debug had the proper platform file specified, Release might NOT !!
13 - 36
Lab 13 C Optimizations Procedure 11. Rebuild and Play. Build and Run. If you get errors, did you remember to set the INCLUDE path for the BSL library? Remember, the Debug configuration is a container of options including your path statements and platform file. So, if you switch configs (Debug to Release), you must also add ALL path statements and other options you want. Dont forget to modify the RTSC settings to point to your _student platform AGAIN! Once built and loaded, your audio should sound fine now that is, if you like to hear music with no treble 12. Benchmark cfir() release mode. Using the same method as before, observe the benchmark for cfir(). Release (-o2, no -g) benchmark for cfir()? __________ cycles
Meet real-time goal? Music sound better? ____________ Heres our picture:
Ok, now were talkin it went from 913K to 37K just by switching to the release configuration. So, the bottom line is TURN ON THE OPTIMIZER !! 13. Study release configuration build properties. Heres a picture of the build options for release:
13 - 37
Click New:
(also note the Remove button where you can delete build configurations). Give the new configuration a name: Opt and choose to copy the existing configuration from Release. Click Ok.
13 - 38
Lab 13 C Optimizations Procedure 15. Change the Opt build properties to use o3 and NO g (the blank choice). The only change that needs to be made is to turn UP the optimization level to o3 vs. o2 which was used in the Release Configuration. Also, make sure g is turned OFF (which it should already be). Open the Opt Config Build Properties and verify it contains NO g (blank) and optimization level of o3.Rebuild your code and benchmark (FYI LED may stop blinkingdont worry). Follow the same procedure as before to benchmark cfir: Opt (-o3, no -g) benchmark for cfir()? __________ cycles
The authors number was about 18K cycles another pretty significant performance increase over o2, -g. We simply went to o3 and killed g and WHAM, we went from 37K to 18K. This is why the author has stated before that the Opt settings we used in this lab SHOULD be the RELEASE settings. But I am not king. So, as you can see, we went from 913K to 18K in about 30 minutes. Wow. But what was the CPU Min? About 7K? Okwe still have some room for improvement Just for kicks and grins, try single stepping your code and/or adding breakpoints in the middle of a function (like cfir). Is this more difficult with g turned OFF and o3 applied? Yep. Note: With g turned OFF, you still get symbol capability i.e. you can enter symbol names into the watch and memory windows. However, it is nearly impossible to single step C code hence the suggestion to create test vectors at function boundaries to check the LOGICAL part of your code when you build with the Debug Configuration. When you turn off g, you need to look at the answers on function boundaries to make sure it is working properly. 16. Turn on verbose and interlist and then see what the .asm file looks like for fir.asm. As noted in the discussion material, to see it all, you need to turn on three switches. Turn them on now, then build, then peruse the fir.asm file. You will see some interesting information about software pipelining for the loops in fir.c. Turn on: RunTime Model Options Verbose pipeline info (-mw)
13 - 39
The authors results were close to the previous results about 15K. Well, this code tuning didnt help THIS algo much, but it might help yours. At least you know how to apply it now. 18. Use restrict keyword on the results array. You actually have a few options to tell the compiler there is NO ALIASING. The first method is to tell the compiler that your entire project contains no aliasing (using the mt compiler option). However, it is best to narrow the scope and simply tell the compiler that the results array has no aliasing (because the WRITES are destructive, we RESTRICT the output array). So, in fir.c, add the following keyword (restrict) to the results (r) parameter of the fir algorithm as shown:
Build, then run again. Now benchmark your code again. Did it improve? Opt + MUST_ITERATE + restrict (-o3, no g) cfir()? Here is what the author got: __________ cycles
Well, getting rid of ALIASING was a big help to our algo. We went from about 15K down to 7K cycles. You could achieve the same result by using -mt compiler switch, but that tells the compiler that there is NO aliasing ANYWHERE scope is huge. Restrict is more restricted.
13 - 40
Lab 13 C Optimizations Procedure 19. Use _nassert() to tell optimizer about data alignment. Because the receive buffers are set up using STRUCTURES, the compiler may or may not be able to determine the alignment of an ELEMENT (i.e. rcvPingL.hist) inside that structure thus causing the optimizer to be conservative and use redundant loops. You may have seen the benchmarks have two results the same, and one larger. Or, you may not have. It usually happens on Thursdays. It is possible that using _nassert() may help this situation. Again, this fix is only needed in this specific case where the memory buffers were allocated using structures (see main.h if you want a looksy). Uncomment the two _nassert() intrinsics in fir.c inside the cfir() function and rebuild/run and check the results. Here is what the author got (same as beforebut hey, worth a try):
20. Turn on symbolic debug with FULL optimization. This is an important little trick that you need to know. As we have stated before, it is impossible to single step your code when you have optimization turned on to level o3. You are able to place breakpoints at function entry/exit points and check your answers, but thats it. This is why FUNCTION LEVEL test vectors are important. There are two ways to accomplish this. Some companies use script code to place breakpoints at specific disassembly symbols (function entry/exit) and run test vectors through automatically. Others simply want to manually set breakpoints in their source code and hit RUN and see the results. While still in the Debug perspective with your program loaded, select: Restart The execution pointer is at main, but do you see your main() source file? Probably not. Ok, pop over to Edit perspective and open fir.c. Set a breakpoint at the beginning of the function. Hit RUN. Your program will stop at that breakpoint, but in the Debug perspective, do you see your source file associated with the disassembly? Again, probably not. Again, hit Restart to start your program at main() again. How do you tell the compiler to add JUST ENOUGH debug info to allow your source files to SYNC with the disassembly but not affect optimization? There is a little known option that allows this
13 - 41
Lab 13 C Optimizations Procedure Make sure you have the Opt configuration selected, right click and choose Properties. Next, check the box below (at C6000 Compiler Runtime Model Options) to turn on symbolic debug with FULL Optimization (-mn):
TURN ON g (symbolic debug). mn only makes sense if g is turned ON. Go back to the basic options and select Full Symbolic Debug. Rebuild and load your program. The execution pointer should now show up along with your main.c file. Hit Restart again. Set a breakpoint in the middle of FIR_process() function inside fir.c. You cant do it. The breakpoint snaps to the beginning or end of the function, right? Make sure the breakpoint is at the beginning of FIR_process() and hit RUN. You can now see your source code synced with the disassembly. Very nice. But did this affect your optimization and your benchmark? Go try it. Hit Restart again and remove all breakpoints. Then RUN. Halt your program and check your benchmark. Is it about the same? It should be
13 - 42
Wow, for what we wanted in THIS system (a fast simple FIR routine), we would have been better off just using DSPLib. Yep. But, in the process, youve learned a great deal about optimization techniques across the board that may or may not help your specific system. Remember, your mileage may vary.
13 - 44
Conclusion
Hopefully this exercise gave you a feel for how to use some of the basic compiler/optimizer switches for your own application. Everyones mileage may vary and there just might be a magic switch that helps your code and dosent help someone elses. Thats the beauty of trial and error. Conclusion? TURN ON THE OPTIMIZER ! Was that loud enough? Heres what the author came up with how did your results compare? Optimizations Debug Bld Config No opt Release (-o2, -g) Opt (-o3, no g) Opt + MUST_ITERATE Opt + MUST_ITERATE + restrict DSPLib (FIR) Benchmark 913K 37K 18K 15K 7K 7K
Regarding ms3, use it wisely. It is more useful to add this option to functions that are large but not time critical like IDL functions, init code, maintenance type items.You can save some code space (important) and lose some performance (probably a dont care). For your time-critical functions, do not use ms ANYTHING. This is just a suggestion again, your mileage may vary. CPU Min was 4K cycles. We got close, but didnt quite reach it. The authors believe that it is possible to get closer to the 4K benchmark by using intrinsics and the DDOTP instruction. The biggest limiting factor in optimizing the cfir routine is the sliding window. The processor is only allowed ONE non-aligned load each cycle. This would happen 75% of the time. So, the compiler is already playing some games and optimizing extremely well given the circumstances. It would require hand-tweaking via intrinsics and intimate knowledge of the architecture to achieve much better. 29. Terminate the Debug session, close the project and close CCS. Power-cycle the board.
Throw something at the instructor to let him know that youre done with the lab. Hard, sharp objects are most welcome
13 - 45
Additional Information
Additional Information
IDMA0 operates on a block of 32 contiguous 32-bit registers (both src/dst blocks User provides: Src, Dst, Count and mask (Reference: SPRU871)
Src
must be aligned on a 32-word boundary). Optionally generate CPU interrupt if needed. Periph Cfg 0 31
. .
. .
Count = # of 32-register blocks to xfr (up to 16) Mask = 32-bit mask determines WHICH registers to transfer (0 = xfr, 1 = NO xfr)
32-bits
32-bits
Example Transfer using MASK (not all regs typically need to be programmed):
Source address 0 8 1 10 4 12 22 27 29 23 31 27 29 5 6 Destination address 0 8 1 10 4 12 22 23 31 5 6
Mask = 01010111001111111110101010001100
User must write to IDMA0 registers in the following order (COUNT written triggers transfer):
IDMA0_MASK = 0x573FEA8C; IDMA0_SOURCE = reg_ptr; IDMA0_DEST = MMR_ADDRESS; IDMA0_COUNT = 0; //set //set //set //set mask for src addr dst addr mask for 13 regs above in L1D/L2 to config location 1 block of 32 registers
13 - 46
Additional Information
13 - 47
Notes
Notes
13 - 48
Objectives
Objectives
Compare/contrast different uses of
memory (internal, external, cache)
Define cache terms and definitions Describe C6000 cache architecture Demonstrate how to configure and use
cache optimally
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 1
Module Topics
Module Topics
Cache & Internal Memory .................................................................................................... 11-1 Module Topics.................................................................................................................... 11-2 Why Cache? ...................................................................................................................... 11-3 Cache Basics Terminology .............................................................................................. 11-4 Cache Example.................................................................................................................. 11-7 L1P Program Cache ...................................................................................................... 11-10 L1D Data Cache............................................................................................................ 11-13 L2 RAM or Cache ? ....................................................................................................... 11-15 Cache Coherency (or Incoherency?) ................................................................................ 11-17 Coherency Example ..................................................................................................... 11-17 Coherency Reads & Writes........................................................................................ 11-18 Cache Functions Summary........................................................................................ 11-21 Coherency Use Internal RAM ! .................................................................................. 11-22 Coherency Summary ................................................................................................. 11-22 Cache Alignment .......................................................................................................... 11-23 Turning OFF Cacheability (MAR)...................................................................................... 11-24 Additional Topics.............................................................................................................. 11-26 Chapter Quiz.................................................................................................................... 11-29 Quiz Answers ............................................................................................................ 11-30 Lab 14 Using Cache ...................................................................................................... 11-31 Lab Overview: .............................................................................................................. 11-31 Lab 14 Using Cache Procedure .................................................................................. 11-32 A. Run System From Internal RAM .............................................................................. 11-32 B. Run System From External DDR2 (no cache).......................................................... 11-33 C. Run System From DDR2 (cache ON) ................................................................... 11-34
11 - 2
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Why Cache?
Why Cache?
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 3
11 - 4
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 5
11 - 6
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Example
Cache Example
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 7
Cache Example
11 - 8
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Example
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 9
11 - 10
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Scheme
Direct Mapped Direct Mapped Direct Mapped
Size
4K bytes
Linesize
64 bytes (16 instr) 32 bytes (8 instr) 32 bytes (8 instr)
New Features
N/A
16K bytes
32K bytes
Next two slides discuss Cache/RAM and Freeze features. Memory Protection is not discussed in this workshop.
Cache/Ram...
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 11
11 - 12
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 13
11 - 14
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
L2 RAM or Cache ?
L2 RAM or Cache ?
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 15
L2 RAM or Cache ?
11 - 16
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 17
11 - 18
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 19
CPU
XmtBuf
writeback
XmtBuf EDMA
When the CPU is finished with the data (and has written it to XmtBuf in L2), it can be sent to ext. memory with a cache writeback A writeback is a copy operation from cache to memory, writing back the modified (i.e. dirty) memory locations all writebacks operate on full cache lines Use BIOS Cache APIs to force a writeback:
BIOS: Cache_wb (XmtBuf, BUFFSIZE, CACHE_NOWAIT);
11 - 20
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 21
Coherency Summary
11 - 22
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Alignment
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 23
CPU
XmtBuf
Memory Attribute Registers (MARs) enable/disable DATA caching memory ranges Dont use MAR to solve basic cache coherency performance will be too slow Use MAR when you have to always read the latest value of a memory location, such as a status register in an FPGA, or switches on a board. MAR is like volatile. You must use both to always read a memory location: MAR for cache; volatile for the compiler Looking more closely at the MAR registers ...
11 - 24
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 25
Additional Topics
Additional Topics
11 - 26
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Additional Topics
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 27
Additional Topics
11 - 28
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Chapter Quiz
Chapter Quiz
Chapter Quiz
1. How do you turn ON the cache ? 2. Name the three types of caches & their associated memories: 3. All cache operations affect an aligned cache line. How big is a line? 4. Which bit(s) turn on/off cacheability and where do you set these? 5. How do you fix coherency when two bus masters access extl mem? 6. If a dirty (newly written) cache line needs to be evicted, how does that dirty line get written out to external memory?
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 29
Chapter Quiz
Quiz Answers
Chapter Quiz
1. How do you turn ON the cache ?
Set size > 0 in platform package (or via Cache_setSize() during runtime)
3. All cache operations affect an aligned cache line. How big is a line?
L1P 32 bytes (256 bits), L1D 64 bytes, L2 128 bytes
4. Which bit(s) turn on/off cacheability and where do you set these?
MAR (Mem Attribute Register), affects 16MB Extl data space, .cfg
5. How do you fix coherency when two bus masters access extl mem?
Invalidate before a read, writeback after a write (or use L2 mem)
6. If a dirty (newly written) cache line needs to be evicted, how does that dirty line get written out to external memory?
Cache controller takes care of this
11 - 30
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Lab Overview:
There are two goals in this lab: (1) to learn how to turn on and off cache and the effects of each on the data buffers and program code; (2) to optimize a hi-pass FIR filter written in C. To gain this basic knowledge you will: A. Learn to use the platform and CFG files to setup cache memory address range (MAR bits) and turn on L2 and L1 caches. B. Benchmark the system performance with running code/data externally (DDR2) vs. with the cache on vs. internal (IRAM).
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 31
2. Ensure BUFFSIZE is 256 in main.h. In order to compare our cache lab to the OPT lab, we need to make sure the buffer sizes are the same which is 256. 3. Find out where code and data are mapped to in memory. First, check Build Properties for the Opt configuration. Make sure you are using YOUR student platform file in this configuration. Then, view the platform file and determine which memory segments (like IRAM) contain the following sections: Section .text .bss .far Its not so simple, is it? .bss and .far sections are data and .text is code. If you didnt know that, you couldnt answer the question. So, they are all allocated in IRAM if not, please make sure they are before moving on. 4. Which cache areas are turned on/off (circle your answer)? L1P L1D L2 OFF/ON OFF/ON OFF/ON Memory Segment
Leave the settings as is. 5. Build, load. BEFORE YOU RUN, open up the Raw Logs window. Click Run and write down below the benchmarks for cfir():
11 - 32
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
7. Clean project, build, load, run using the Opt Configuration. Select Project Clean (this will ensure your platform file is correct). Then Build and load your code. Run your code. Listen to the audio how does it sound? Its DEAD thats how it sounds just air bad air it is the absence of noise. Plus, we cant see anything because the CPU is overloaded and therefore no RTA tools. Ah, but Log_info() just might save us again. Go look at the Raw Logs and see if the benchmark is getting reported.
Did you get a cycle count? The author experienced a total loss absolute NOTHING. I think the system is so out of it, it crashes. In fact, CCS crashed a few times in this mode. Yikes. I vote for calling it the national debt #cycles uh, what is it now $15 Trillion? Ok, 15 trillion cycles ;-)
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 33
Set L1D/P to 32K and L2 to 64K IF YOU DONT SET L2 CACHE ON, YOU WILL CACHE IN L1 ONLY. Watch it, though, when you reconfigure cache sizes, it wipes your memory sections selections. Redo those properly after you set the cache sizes. These sizes are larger than we need, but it is good enough for now. Leave code/data in DDR and stacks in IRAM. Click Ok to rebuild the platform package. The system we now have is identical to one of the slides in the discussion material. 9. Wait what about the MAR bits? In the discussion material, we talked about the MAR bits specifying which regions were cacheable and which were not. Dont we have to set the MAR bits for the external region of DDR for them to get cached? Yep. In order to modify (or even SEE) the MAR bits OR use any BIOS Cache APIs (like invalidate or writeback), you need to add the C64p Cache Module to your .cfg file. Or, you can simply right-click (and Use) the Cache module listed under: Available Products SYS/BIOS Target Specific Support C674 Cache (as shown in the discussion material).
11 - 34
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Save the .cfg file. This SHOULD add the module to your outline view. When it shows up in the outline view, click on it. Do you see the MAR bits? The MAR region we are interested in, by the way, for DDR2 is MAR 192-223. As a courtesy to users, the platform file already turned on the proper MAR bits for us for the DDR2 region. Check it out:
The good news is that we dont need to worry about the MAR bits for now. 10. Build, load, run using the Opt (duh) Configuration. Run the program. View the CPU load graph and benchmark stat and write them down below:
With code/data external AND the cache ON, the benchmark should be close to 8K cycles the SAME as running from internal IRAM (L2). In fact, what youre seeing is the L1D/P numbers. Why? Because L2 is cached in L1D/P the closest memory to the CPU. This is what a cache does for you especially with this architecture. Heres what the author got:
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
11 - 35
11. What about cache coherency? So, how does the audio sound with the buffers in DDR2 and the cache on? Shouldnt we be experiencing cache coherency problems with data in DDR2? Well, the audio sounds great, so why bother? Think about this for awhile. What is your explanation as to why there are NO cache coherency problems in this lab. Answer: _______________________________________________________________ 12. Conclusion and Summary long read but worth it It is amazing that you get the same benchmarks from all code/data in internal IRAM (L2) and L1 cache turned on as you do with code/data external and L2/L1 cache turned on. In fact, if you place the buffers DIRECTLY in L1D as SRAM, the benchmark is the same. How can this be? Thats an efficient cache, eh? Just let the cache do its thing. Place your buffers in DDR2, turn on the cache and move on to more important jobs. Heres another way to look at this. Cache is great for looping code (program, L1P) and sequentially accessed data (e.g. buffers). However, cache is not as effective at random access of variables. So, what would be a smart choice for part of L1D as SRAM? Coefficient tables, algorithm tables, globals and statics that are accessed frequently, but randomly (not sequential) and even frequently used ISRs (to avoid cache thrashing). The random data items would most likely fall into the .bss compiler section. Keep that in mind as you design your system. Lets look at the final results: System Buffers in IRAM (internal) All External (DDR2), cache OFF All External (DDR2), cache ON Buffers in L1D SRAM benchmark 8K cycles ~4M 8K cycles 7K cycles
So, will you experience the same results? 150x improvement with cache on and not much difference between internal memory only and external with cache on? Probably something similar. The point here is that turning the cache ON is a good idea. It works well and there is little thinking that is required unless you have peripherals hooked to external memory (coherency). For what it is worth, youve seen the benefits in action and you know the issues and techniques that are involved. Mission accomplished.
RAISE YOUR HAND and get the instructors attention when you have completed PART A of this lab. If time permits, move on to the next OPTIONAL part
Youre finished with this lab. If time permits, you may move on to additional optional steps on the following pages if they exist.
11 - 36
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Using EDMA3
Introduction
In this chapter, you will learn the basics of the EDMA3 peripheral. This transfer engine in the C64x+ architecture can perform a wide variety of tasks within your system from memory to memory transfers to event synchronization with a peripheral and auto sorting data into separate channels or buffers in memory. No programming is covered. For programming concepts, see ACPY3/DMAN3, LLD (Low Level Driver covered in the Appendix) or CSL (Chip Support Library). Heck, you could even program it in assembly, but dont call ME for help.
Objectives
At the conclusion of this module, you should be able to: Understand the basic terminology related to EDMA3 Be able to describe how a transfer starts, how it is configured and what happens after the transfer completes Undersand how EDMA3 interrupts are generated Be able to easily read EDMA3 documentation and have a great context to work from to program the EDMA3 in your application
15 - 1
Module Topics
Module Topics
Using EDMA3 ....................................................................................................................... 15-1 Module Topics.................................................................................................................... 15-2 Overview............................................................................................................................ 15-3 What is a DMA ? .......................................................................................................... 15-3 Multiple DMAs.............................................................................................................. 15-4 EDMA3 in C64x+ Device ................................................................................................ 15-5 Terminology ....................................................................................................................... 15-6 Overview ........................................................................................................................ 15-6 Element, Frame, Block ACNT, BCNT, CCNT ............................................................... 15-7 Simple Example ............................................................................................................. 15-7 Channels and PARAM Sets ............................................................................................ 15-8 Examples ........................................................................................................................... 15-9 Synchronization ............................................................................................................... 15-12 Indexing ........................................................................................................................... 15-13 Events Transfers Actions ............................................................................................ 15-15 Overview ...................................................................................................................... 15-15 Triggers ........................................................................................................................ 15-16 Actions Transfer Complete Code ............................................................................... 15-16 EDMA Interrupt Generation .............................................................................................. 15-17 Linking ............................................................................................................................. 15-18 Chaining .......................................................................................................................... 15-19 Channel Sorting ............................................................................................................... 15-21 Architecture & Optimization .............................................................................................. 15-22 Programming EDMA3 Using Low Level Driver (LLD) ..................................................... 15-23 Chapter Quiz.................................................................................................................... 15-25 Quiz Answers ............................................................................................................ 15-26 Additional Information....................................................................................................... 15-27 Notes ............................................................................................................................... 15-30
15 - 2
Overview
15 - 3
Overview
Multiple DMAs
Multiple DMAs : EDMA3 and QDMA
VPSS EDMA3 (System DMA) DMA (sync) QDMA (async) L2 C64x+ DSP L1P L1D
Master Periph
DMA Enhanced DMA (version 3) DMA to/from peripherals Can be syncd to peripheral events Handles up to 64 events
QDMA Quick DMA DMA between memory Async must be started by CPU 4-16 channels available
Both Share (number depends upon specific device) 128-256 Parameter RAM sets (PARAMs) 64 transfer complete flags 2-4 Pending transfer queues
15 - 4
Overview
EDMA3
TC0 TC1 CC x2 TC2
Master
Slave
McASP
McBSP PCI
32
L3
AET
D S M L
PERIPH
S Cfg
ARM
L2
S M M S
L2 Mem Ctrl
External Mem Cntl
IDMA
CPU
L1D Mem Ctrl
128 128
DATA SCR
L1D
32
EDMA3 is a master on the DATA SCR it can initiate data transfers SCR EDMA3s configuration registers are accessed via the CFG SCR (by the CPU) Each TC has its own connection (and priority) to the DATA SCR. Refer to the connection matrix to determine valid connections
CFG
15 - 5
Terminology
Terminology Overview
15 - 6
Terminology
Element
. .
Frame M C Count
Transfer Configuration Options Source B A Transfer Count Destination Index Cnt Reload Link Addr Index Index Rsvd C
. .
Elem N B Count
B Count (# Elements)
31
C Count (# Frames)
31 16 15 0
Simple Example
15 - 7
Terminology
15 - 8
Examples
Examples
15 - 9
Examples
15 - 10
Examples
15 - 11
Synchronization
Synchronization
15 - 12
Indexing
Indexing
15 - 13
Indexing
15 - 14
15 - 15
Triggers
15 - 16
15 - 17
Linking
Linking
15 - 18
Chaining
Chaining
15 - 19
Chaining
15 - 20
Channel Sorting
Channel Sorting
15 - 21
Periphs
E1 E0
EDMA Architecture
Queue
Q0 Q1 Q2 Q3 PSET 0 PSET 1
Evt Reg (ER) Evt Enable Reg (EER) Evt Set Reg (ESR) Chain Evt Reg (CER)
CC
TR Submit
Early TCC
TC
TC0 TC1 TC2 TC3
Normal TCC SCR = Switched Central Resource DATA SCR
. . .
Completion Detection
EDMA consists of two parts: Channel Controller (CC) and Transfer Controller (TC) An event (from periph-ER/EER, manual-ESR or via chaining-CER) sends the transfer to 1 of 4 queues (Q0 is mapped to TC0, Q1-TC1, etc. Note: McBSP can use TC1 only) Xfr mapped to 1 of 256 PSETs and submitted to the TC (1 TR transmit request per ACNT bytes or A*B CNT bytes based on sync). Note: Dst FIFO allows buffering of writes while more reads occur. The TC performs the transfer (read/write) and then sends back a transfer completion code (TCC) The EDMA can then interrupt the CPU and/or trigger another transfer (chaining Chap 6)
15 - 22
15 - 23
Programming EDMA3 Using Low Level Driver (LLD) *** this page used to have very valuable information on it ***
15 - 24
Chapter Quiz
Chapter Quiz
1. Name the 4 ways to trigger a transfer?
Chapter Quiz
3. Fill out the following values for this channel sorting example (5 min):
PERIPH
L0 R0 L1 R1 L2 R2 L3 R3
MEM
L0 L1 L2 L3 R0 R1 R2 R3
16-bit stereo audio (interleaved) Use EDMA to auto channel sort to memory
BUFSIZE
ACNT: _____ BCNT: _____ CCNT: _____ BIDX: _____ CIDX: _____
Could you calculate these ?
15 - 25
Chapter Quiz
Quiz Answers
1. Name the 4 ways to trigger a transfer?
Chapter Quiz
3. Fill out the following values for this channel sorting example (5 min):
PERIPH
L0 R0 L1 R1 L2 R2 L3 R3
MEM
L0 L1 L2 L3 R0 R1 R2 R3
16-bit stereo audio (interleaved) Use EDMA to auto channel sort to memory
BUFSIZE
2 ACNT: _____ 2 BCNT: _____ CCNT: _____ 4 BIDX: _____ 8 -6 CIDX: _____
Could you calculate these ?
15 - 26
Additional Information
Additional Information
15 - 27
Additional Information
15 - 28
Additional Information
15 - 29
Notes
Notes
15 - 30
Topic Choices
16 - 1
16 - 2
Intro to DSP/BIOS
Introduction
In this chapter an introduction to the general nature of real-time systems and the DSP/BIOS operating system will be considered. Each of the concepts noted here will be studied in greater depth in succeeding chapters.
Objectives
Objectives
Describe how to create a new BIOS project Learn how to configure BIOS using TCF files Lab 16a Create and debug a simple
DSP/BIOS application
Grabbag 16a - 1
Module Topics
Module Topics
Intro to DSP/BIOS................................................................................................................. 16-1 Module Topics.................................................................................................................... 16-2 DSP/BIOS Overview .......................................................................................................... 16-3 Threads and Scheduling..................................................................................................... 16-4 Real-Time Analysis Tools ................................................................................................... 16-6 DSP/BIOS Configuration Using TCF Files ....................................................................... 16-7 Creating A DSP/BIOS Project............................................................................................. 16-8 Memory Management Using the TCF File...................................................................... 16-10 Lab 16a: Intro to DSP/BIOS.............................................................................................. 16-11 Lab 16a Procedure .................................................................................................... 16-12 Create a New Project................................................................................................ 16-12 Add a New TCF File and Modify the Settings ............................................................ 16-14 Build, Load, Play, Verify ........................................................................................ 16-16 Benchmark and Use Runtime Object Viewer (ROV) .................................................. 16-19 Additional Information & Notes ......................................................................................... 16-22 Notes ............................................................................................................................... 16-24
Grabbag 16a - 2
DSP/BIOS Overview
DSP/BIOS Overview
Grabbag 16a - 3
Grabbag 16a - 4
Grabbag 16a - 5
Grabbag 16a - 6
Grabbag 16a - 7
Grabbag 16a - 8
Grabbag 16a - 9
Remember ?
Sections .text .bss .far .cinit .cio .stack
1180_0000
Memory Segments
256K IRAM
6400_0000
4MB
FLASH DDR2
C000_0000 512MB
How do you define the memory segments (e.g. IRAM, FLASH, DDR2) ? How do you place the sections into these memory segments ?
How do we accomplish this with a .tcf file ?
21
Grabbag 16a - 10
Application: blink USER LED_1 on the EVM every second Key Ideas: main() returns to BIOS scheduler, IDL fxn runs to blink LED What will you learn? .tcf file mgmt, IDL fxn creation/use, creation of BIOS project, benchmarking code, ROV Pseudo Code:
main() init BSL, init LED, return to BIOS scheduler ledToggle() IDL fxn that toggles LED_1 on EVM
Grabbag 16a - 11
Grabbag 16a - 12
2. Choose a Project template. This screen was brand new in CCSv4.2.2. And it is not intuitive to the casual observer that the Next button above even exists you see Finish, you click it. Ah, but the hidden secret is the Next button. The CCS developers are actually trying to do us a favor IF you understand what a BIOS template is.
As you can see, there are many choices. Empty Projects are just that empty just a path to the include files for the selected processor. Go ahead and click on Basic Exmaples to see whats inside. Click on all the other + signs to see what they contain. Ok, enough playing around. We are using BIOS 5.41.xx.xx in this workshop. So, the correct + sign to choose in the end is the one that is highlighted above. 3. Choose the specific BIOS template for this workshop. Next, youll see the following screen:
Select Empty Example. This will give us the paths to the BIOS include directories. The other examples contain example code and .tcf files. NOW you can click Finish.
Grabbag 16a - 13
Lab 16a: Intro to DSP/BIOS 4. Add files to your project. From the labs \Files directory, ADD the following files: led.c, main.c, main.h Open each and inspect them. They should be pretty self explanatory. 5. Link the LogicPD BSL library to your project as before. 6. Add an include path for the BSL library \inc directory. Right-click on the project and select Build Properties. Select C6000 Compiler, then Include Options (youve done this before). Add the proper path for the BSL include dir (else you will get errors when you build). At this point in time, what files are we missing? There are 3 of them. Can you name them? ______________ ______________ ______________
Grabbag 16a - 14
Check the box that says create a heap in this memory (if not already checked) and change and change the heap size to 4000h. Click Ok. Now that we HAVE a heap in IRAM (thats another name for L2 by the way), we need to tell the mother ship (MEM) where our heap is. Right-click on MEM and select Properties. Click on both down arrows and select IRAM for both (again, this is probably already done for you). Click OK. Now shes happy
Save the TCF file. Note: FYI throughout the labs, we will throw in the top 10 or 20 tips that cause Debug nightmares during development. Heres your first one TIP #1 Always create a HEAP when working with BIOS projects.
Hint:
Grabbag 16a - 15
Hint:
TIP #2 Always #include the cfg.h file in your application code when using BIOS as the FIRST included header file.
11. Inspect the generated files resulting from our new TCF file. In the project view, locate the following files and inspect them (actually, youll need to BUILD the project before these show up): bios_ledcfg.h bios_ledcfg.cmd
There are other files that get generated by the existence of .tcf which we will cover in later labs. The .cmd file is automatically added to your project as a source file. However, your code must #include the cfg.h file or the compiler will think all the BIOS stuff is declared implicitly.
Grabbag 16a - 16
12. Debug and Play your code. Click the Debug Bug this is equivalent to Debug Active Project. Remember, this code blinks LED_1 near the bottom of the board. When you Play your code and the LED blinks, youre done. When the execution arrow reaches main(), hit Play. Does the LED blink? No? What is going on? Think back to the scheduling diagram and our discussions. To turn BIOS ON, what is the most important requirement? main() must RETURN or fall out via a brace }. Check main.c and see if this is true. Many users still have while() loops in their code and wonder why BIOS isnt working. If you never return from main(), BIOS will never run. Hint: TIP #3 BIOS will NOT run if you dont exit main().
Ok, so no funny tricks there - that checks out. Next question: how is the function ledToggle() getting called? Was it called in main()? Hmmm. Who is supposed to call ledToggle()? When your code returns from main(), where does it go? The BIOS scheduler. And, according to our scheduling diagram and the threads we have in the system, which THREAD will the scheduler run when it returns from main()? Can you explain what needs to be done? ________________________________________
Grabbag 16a - 17
13. Add IDL object to your TCF. The answer is: the scheduler will run the IDL thread when nothing else exists. All other thread types are higher priority. So, how do you make the IDL thread call ledToggle()? Simple. Add an IDL object and point it to our function. Open the TCF file and click on Scheduling. Right-click on IDL and select Insert IDL. Name the IDL Object IDL_ledToggle. Now that we have the object, we need to tell the object what to do which fxn to run. Rightclick on IDL_ledToggle and select Properties. Youll notice a spot to type in the function name. Ok, make room for another important tip. BIOS is written in ASSEMBLY. The ledToggle() function is written in C. How does the compiler distinguish between an assembly label or symbol and a C label? The magic underscore _. All C symbols and labels (from an assembly point of view) are preceded with an underscore. Hint: TIP #4 When entering a fxn name into BIOS objects, precede the name with an underscore _. Otherwise you will get a symbol referencing error which is difficult to locate.
SO, the fxn name you type in here must be preceded by an underscore:
You have now created an IDL object that is associated with a fxn. By the way, when you create HWI, SWI and TSK objects later on, guess what? It is the SAME procedure. Youll get sick of this by the end of the week right-click, insert, rename, right-click and select Properties, type some stuff. There that is DSP/BIOS in a nutshell. 14. Build and Debug AGAIN. When the execution arrow hits main(), click Play. You should now see the LED blinking. If you ever HALT/PAUSE, it will probably pause inside a library fxn that has no source associated with it. Just X that thing. At this point, your first BIOS project is working. Do NOT terminate all yet. Simply click on the C/C++ perspective and move on to a few more goodies
Grabbag 16a - 18
Dont type in the call to LOG_printf() just yet. Well do that in a few moments
Grabbag 16a - 19
16. Build, Debug, Play. When finished, build your project it should auto-download to the EVM. Switch to the Debug perspective and set a breakpoint as shown in the previous diagram. Click Play. When the code stops at your breakpoint, select View Local. Heres the picture of what that will look like:
Are you serious? 1.57M CPU cycles. Of course. This mostly has to do with going through I2C and a PLD and waiting forever for acknowledge signals (can anyone say BUS HOLD?). Also, dont forget were using the Debug build configuration with no optimization. More on that later. Nonetheless, we have our benchmark. 17. Open up TWO .tcf files is this a problem? The author has found a major uh oh that you need to be aware of. Open your .tcf file and keep it open. Double-click on the projects TCF file AGAIN. Another instance of this window opens. Nuts. If you change one and save the other, what happens? Oops. So, we recommend you NOT minimize TCF windows and then forget you already have one open and open another. Just BEWARE 18. Add LOG Object and LOG_printf() API to display benchmark. Open led.c for editing and add the LOG_printf() statement as shown in a previous diagram. Open the TCF for editing. Under Instrumentation, add a new LOG object named trace. Remember? Right-click on LOG, insert log, rename to trace, click OK.
Grabbag 16a - 20
19. Pop over to Windows Explorer and analyse the \Project folder. Remember when we said that another folder would be created if you were using BIOS? It was called .gconf. This is the GRAPHICAL config tool in action that is fed by the .cdb file. When you add a .tcf file, the graphical and textual tools must both exisit and follow each other. Go check it out. Is it there? Okback to the action 20. Build, Debug, Play use ROV. When the code loads, remove the breakpoint in led.c. Then, click Play. PAUSE the execution after about 5 seconds. Open the ROV tool via Tools ROV. When ROV opens, select LOG and one of the sequence numbers like 2 or 3:
Notice the result of the LOG_printf() under message. You can choose other sequence numbers and see what their times were. You can also choose to see the LOG messages via Tools RTA Printf Logs. Try that now and see what you get. If youd like to change the behaviour of the LOGging, go back to the LOG object and try a bigger buffer, circular (last N samples) or fixed (first N samples). Experiment away When we move on to a TSK-based system, the ROV will come in very handy. This tool actually replaced the older KOV (kernel object viewer) in the previous CCS. Also, in future labs, well use the RTA (Real-time Analysis) tools to view Printf logs directly. By then, youll know two different ways to access debug info. Note: Explain this to me so, the tool is called ROV which stands for RUNTIME Object Viewer. But the only way to VIEW the OBJECT is in STOP time. Hmmm. Marketing? Illegal drug use? Ok, so it collects the data during runtimebut stillto the author, this is a stretch and confuses new users. Ah, but now you know the rest of the story
Terminate the Debug Session and close the project. Youre finished with this lab. Please raise your hand and let the instructor know you are finished with this la (maybe throw something heavy at them to get their attention or say CCS crashed AGAIN ! that will get them running)
Grabbag 16a - 21
Grabbag 16a - 22
Grabbag 16a - 23
Notes
Notes
Grabbag 16a - 24
Objectives
Objectives
Compare/contrast the startup events of
CCS (GEL) vs. booting from flash
SPI Flash Writer utilities to create and burn a flash boot image a bootable flash image, POR, run
GrabBag 16b - 1
Module Topics
Module Topics
Booting From Flash ............................................................................................................. 16-1 Module Topics.................................................................................................................... 16-2 Booting From Flash ............................................................................................................ 16-3 Boot Modes Overview.................................................................................................. 16-3 System Startup............................................................................................................... 16-4 Init Files.......................................................................................................................... 16-4 AISgen Conversion......................................................................................................... 16-5 Build Process ................................................................................................................. 16-5 SPIWriter Utility (Flash Programmer) .............................................................................. 16-6 ARM + DSP Boot............................................................................................................ 16-7 Additonal Info .............................................................................................................. 16-8 C6748 Boot Modes (S7, DIP_x) ...................................................................................... 16-9 Lab 16b: Booting From Flash ........................................................................................... 16-11 Lab16b Booting From Flash - Procedure.................................................................... 16-12 Tools Download and Setup (Students: SKIP STEPS 1-6 !!)....................................... 16-12 Build Keystone Project: [Src .OUT File] ............................................................... 16-16 Use AISgen To Convert [.OUT .BIN].................................................................... 16-21 Program the Flash: [.BIN SPI1 Flash] .................................................................. 16-29 Optional DDR Usage ............................................................................................. 16-32 Additional Information....................................................................................................... 16-33 Notes ............................................................................................................................... 16-34
GrabBag 16b - 2
GrabBag 16b - 3
System Startup
Init Files
GrabBag 16b - 4
AISgen Conversion
Build Process
GrabBag 16b - 5
GrabBag 16b - 6
GrabBag 16b - 7
Additonal Info
GrabBag 16b - 8
8 7 6 5 4 3 2 1
EMU MODE
ON
8 7 6 5 4 3 2 1
SPI BOOT
OFF
ON
SW7
SW7
GrabBag 16b - 9
Booting From Flash *** this page was accidentally created by a virus please ignore ***
GrabBag 16b - 10
GrabBag 16b - 11
GrabBag 16b - 12
Lab 16b: Booting From Flash 2. Create directories to hold tools and projects. Three directories need to be created: C:\TI-RTOS\C6000\Labs\Lab16b_keystone will contain the audio project (keystone) to build into a .OUT file. C:\TI-RTOS\C6000\Labs\Lab13b_ARM_Boot will contain the ARM boot code required to start up the DSP after booting. C:\TI-RTOS\C6000\Labs\Lab13b_SPIWriter will contain the SPIWriter.out file used to program the flash on the EVM. C:\TI-RTOS\C6000\Labs\Lab13b_AIS contains the AISgen.exe file (shown above) and is where the resulting AIS script (bin) will be located after running the utility (.OUT .BIN)
Place the keystone files into the \Lab16b_keystone\Files directory. Users will build a new project to get their .OUT file. Place the recently downloaded AISgen.exe file into \Lab16b_AIS directory.
GrabBag 16b - 13
Lab 16b: Booting From Flash 3. Download SPI Flash Utilities. You can find the SPI Flash Utility here: http://processors.wiki.ti.com/index.php/Serial_Boot_and_Flash_Loading_Utility_for_OMAP-L138 This is actually a TI wiki page:
From here, locate the following and click here to go to the download page:
This will take you to a SourceForge site that will contain the tools you need to download.
Click on the latest version under OMAP-L138 and download the tar.gz file. UnTAR the contents and youll see this:
The path we need is \OMAP-L138. If we dive down a bit, we will find the SPIWriter.out file that is used to program the flash with our boot image (.bin).
GrabBag 16b - 14
4. Copy the SPIWriter.out file to \Lab13b_SPIWriter\ directory. Shown below is the initial contents of the Flash Utility download:
Copy the following file to the \Lab13b_SPIWriter\ directory: SPIWriter_OMAP-L138.out 5. Install AISgen. Find the download of the AISgen.exe file and double-click it to install. After installation, copy a shortcut to the desktop for this program:
6. Create the keystone project. Create a new CCSv5 SYS/BIOS project with the source files listed in C:\SYSBIOSv4\Lab13b_keystone\Files. Create this project in the neighboring \Project folder. Also, dont forget to add the BSL library and BSL includes (as normal) Make sure you use the RELEASE configuration only.
GrabBag 16b - 15
Hint:
Hint:
8. Set address of reset vector for DSP Here is one of the tricks that must be employed when using both the ARM and DSP. The ARM code has to know the entry point (reset vector, c_int00) of the DSP. Well, if you just compile and link, it could go anywhere in L2. If your class is based on SYS/BIOS, please follow those instructions. If youre interested in how this is done with DSP/BIOS, that solution is also provided for your reference.
SYS/BIOS Users must add two lines of script code to the CFG file as shown. This script
forces the reset vect or address for the DSP to 0x11830000. Locate this in the given .cfg file and UNCOMMENT these two lines of code.
the reset vector. This little command file specifies EXACTLY where the .boot section should go for a BIOS project (this is not necessary for a non-BIOS program).
DSP/BIOS Users must create a linker.cmd file as shown below to force the address of
GrabBag 16b - 16
Lab 16b: Booting From Flash 9. Examine the platform file. In the previous step, we told the tools to place the DSP reset vector specifically at address 0x11830000. This is the upper 64K of the 256K block of L2 RAM. One of our labs in the workshop specified L2 cache as 64K. Guess what? If that setting is still true, L2 cache effective starts at the same address which means that this address is NOT available for the reset vector. WHOOPS. Select Build Options and determine WHICH platform file is associated with this project. Once you have determined which platform it is, open it and examine it. Make sure L2 cache is turned off or ZERO and that all code/data/stack segments are allocated in IRAM. If this is not true, then make it so. 10. Build the keystone project. Update all tools for XDC, BIOS, UIA. Kill Agent. Update Compiler basically update everything to your latest tool set to get rid of errors and warnings. Using the DEBUG build configuration, build the project. This should create the .OUT file. Go check the \Debug directory and locate the .OUT file: keystone_flash.out Load the .OUT file and make sure it executes properly. We dont want to flash something that isnt working. Do not close the Debug session yet.
GrabBag 16b - 17
Lab 16b: Booting From Flash 11. Determine silicon rev of the device you are currently using. AISgen will want to know which silicon rev you are using. Well, you can either attempt to read it off the device itself (which is nearly impossible) or you can visit a convenient place in memory to see it. Now that you have the Debug perspective open, this should be relatively straightforward. Open a memory view window and type in the following address: 0x11700000 Can you see it? No? Shame on you. Ok. Try changing the style view to Character instead. See something different? Like this?
That says d800k002 which means rev2 of the silicon. Thats an older revbut whatever yours iswrite it down below: Silicon REV: ____________________
FYI for OMAP-L138 (and C6748), note the following: d800k002 = Rev 1.0 silicon (common, but old) d800k004 = Rev 1.1 silicon (fairly common) d800k006 = Rev 2.0 silicon (if you have a newer board, this is the latest)
There ARE some differences between Rev1 and Rev2 silicon that well mention later in this lab very important in terms of how the ARM code is written. You will probably NEVER need to change the memory view to Character ever again so enjoy the moment. Next, we need to convert this .out file and combine it with the ARM .out file and create a single flash image for both using the AIS script via AISgen
GrabBag 16b - 18
Lab 16b: Booting From Flash 12. Use the Debug GEL script to locate the Silicon Rev. This script can be run at any time to debug the state of your silicon and all of the important registers and frequencies your device is running at. This file works for both OMAP-L137/8 and C6747/8 devices. It is a great script to provide feedback for your hardware engineer. It goes kind of like this: we want a certain frequency for PLL1. We read the documentation and determine that these registers need to be programmed to a, b and c. You write the code, program them and then build/run. Well, is PLL1 set to the frequency you thought it should be? Run the debug script and find out what the processor is reporting the setting is. Nice. This script outputs its results to the Console window. Lets use the debug script to determine the silicon rev as in the previous step. First, we need to LOAD the gel file. This file can be downloaded from the wiki shown in the chapter. We have already done that for you and placed that GEL file in the \gel directory next to the GEL file youve been using for CCS. Select Tools GEL Files. Right-click in the empty area under the currently loaded GEL file and select: Load Gel.
The \gel directory should show up and the file OMAPL1x_debug.gel should be listed. If not, browse to C:\SYSBIOSv4\Labs\DSP_BSL\gel.
Click Open.
GrabBag 16b - 19
This will load the new GEL file and place the scripts under the Scripts menu. Select Scripts Diagnostics Run All:
You can choose to run only a specific script or All of them. Notice the output in the Console window. Scroll up and find the silicon revision. Also make note of all of the registers and settings this GEL file reports. Quite extensive.
Does your report show the same rev as you found in the previous step? Lets hope so Write down the Si Rev again here: Silicon Rev (again): ______________________
GrabBag 16b - 20
This is the INSTALL file (fyi). You dont need to use this if the tool is already installed on your computer 14. Run AISgen. There should be an icon on your desktop that looks like this:
If not, you will need to install the tool by double-clicking on the install file, installing it and then creating a shortcut to it on the desktop (youll find it in Programs Texas Instruments AISgen). Double-click on the icon to launch AISgen and fill out the dialogue box as shown on the next pagethere are several settings you needso be careful and go SLOWLY here It is usually BEST to place all of your PLL and DDR settings in the flash image and have the bootloader set these up vs. running code on the DSP to do it. Why? Because the DSP then comes out of reset READY to go at the top speeds vs. running slow until your code in main() is run. So, thats what we plan to do. Note: Each dialogue has its own section below. It is quite a bit of setupbut hey, you are enabling the bootloader to set up your entire system. This is good stuffbut it takes some work
Hint:
When you actually use the DSP to burn the flash in a later step, the location you store your .bin file too (name of the .bin file AND the directory path you place the .bin file in) CANNOT have ANY SPACES IN THE PATH OR FILENAME.
GrabBag 16b - 21
Note: you will type in these paths in a future step do NOT do it now
GrabBag 16b - 22
Then click on the PLL0 tab and view these settings. You will see the defaults show up. Make the following modifications as shown below. Change the multiplier value from 20 to 25 and notice the values in the bottom RH corner change.
Peripheral Tab
Next, click on the Peripheral tab. This is where you will set the SPI Clock. It is a function (divide down) from the CPU clock. If you leave it at 1MHz, well, it will work, but the bootload will take WAY longer. So, this is a speed up enhancement. Type 20 into the SPI Clock field as shown:
Also check the Enable Sequential Read checkbox. Why is this important? Speed of the boot load. If this box is unchecked, the ROM code will send out a read command (0x03) plus a 24bit address before every single BYTE. That is a TON of read commands. However, if we CHECK this box, the ROM code will send out a single 24-bit address (0x000000) and then proceed to read out the ENTIRE boot image. WAY WAY faster.
GrabBag 16b - 23
Configure PLL1
Just in case you EVER want to put code or data into the DDR, PLL1 needs to be set in the flash image and therefore configured by the bootloader. So, click the checkbox next to Configure PLL1, click on that tab, and use the following settings:
This will clock the DDR at 300MHz. This is equivalent to what our GEL file sets the DDR frequency to. We dont have any code in DDR at the moment but now we have it setup just in case we ever do later on. Now, we need to write values to the DDR config registers
Configure DDR
You know the drill. Click the proper checkbox on the main dialogue page and click on the DDR tab. Fill in the following values as shown. If you want to know what each of the values are on the right, look it up in the datasheet.
GrabBag 16b - 24
This would Enable module 15 of the PSC which says de-assert the reset on the DSP megamodule and enable the clocks so that the ARM can write to the DSP memory located in L2. However, this setting does NOT match what the GEL file did for us. So, we need to enable MORE of the PSC modules so that we match the GEL file. Note: When doing this for your own system, youll need to pick and choose the PSC modules that are important to your specific system.
Better Setting (USE THIS ONE for the lab or as a starting point for your own system)
The numbers scroll out of sight, so here are the values: PSC0: 0;1;2;3;4;5;9;10;11;12;13;15 PSC1: 0;1;2;3;4;5;6;7;9;10;11;12;13;14;15;16;17;18;19;20;21;24;25;26;27;28;29;30;31 Note: Note: PSC1 is MISSING modules 8, 22-23 (see datasheet for more details on these).
GrabBag 16b - 25
Save your .cfg file in the \Lab13b_AIS folder for potential use later on you dont want to have to re-create all of these steps again if you can avoid it. If you look in that folder, it already contains this .cfg file done for you. Ok, so we could have told you that earlier, but then the learning would have been crippled. The author named the solutions config file: OMAP-L138-ARM-DSP-LAB13B_TTO.cfg Hint: Hint: C6748 Users: You will only specify ONE output file (DSP.out) OMAP-L138 Users: You will specify TWO files (an ARM.out and a DSP.out).
GrabBag 16b - 26
For the DSP Application File, browse to the .OUT file that was created when you built your keystone project: keystone_flash.out
Hint:
For OMAP-L138 users: you will enter the paths to both files and AISgen will combine them intoONE image (.bin) to burn into the flash. You must FIRST specify the ARM.out file followed by the DSP.out file this order MATTERS. Follow these steps in order carefully.
Click the button shown above next to ARM Application File to browse to (use \Lab13b instead):
Click Open. Your screen should now look like this (except for using \Lab13b):
This ARM code is for rev1 silicon. It should also work on Rev2 silicon but not tested.
GrabBag 16b - 27
Lab 16b: Booting From Flash Next, click on the + sign (yours will say \Lab13b):
and browse to your keystone_flash.out file you built earlier. You should now have two .out files listed under ARM Application File first the ARM.out, then the DSP.out files separated by a semicolon. Double-check this is the case. The AISgen software wont allow you to see both paths at once in that tiny box, but here is a picture of the middle of the path showing the semicolon in the middle of the two .out files again, the ARM.out file needs to be first followed by the DSP.out file (use \Lab13b instead):
Hint:
For the Output file, name it flash.bin and use the following path: C:\SYSBIOSv4\Labs\Lab13b_AIS\flash.bin
Hint:
Again, the path and filename CANNOT contain any spaces. When you run the flash writer later on, that program will barf on the file if there are any spaces in the path or filename.
Before you click the Generate AIS button, notice the other configuration options you have here. If you wanted AIS to write the code to configure any of these options, simply check them and fill out the info on the proper tab. This is a WAY cool interface. And, the bootloader does system setup for you instead of writing code to do it and making mistakes and debugging those mistakesand getting frustratedlike getting tired of reading this rambling text from the author.
GrabBag 16b - 28
Lab 16b: Booting From Flash 15. Generate AIS script (flash.bin). Click the Generate AIS button. When complete, it will provide a little feedback as to how many bytes were written. Like this:
So, what did you just do? For OMAP-L138 (ARM+DSP) users, you just combined the ARM.out and DSP.out files into one flash image flash.bin. For C6748 Users, you simply converted your .out file to a flash image. The next step is to burn the flash with this image and then let the bootloader do its thing
GrabBag 16b - 29
Lab 16b: Booting From Flash 18. PLAY ! Click Play. The console window will pop up and ask you a question about whether this is a UBL image. The answer is NO. Only if you were using a TI UBL which would then boot Uboot, the answer is no. This assumes that Linux is running. Our ARM code has no O/S. Type a smallcase n and hit [ENTER]. To respond to the next question, provide the path name for your .BIN file (flash.bin) created in a previous step, i.e.: C:\SYSBIOSv4\Labs\Lab13b_AIS\flash.bin Hint: Do NOT have any spaces in this path name for SPIWriter it NO WORK that way.
Heres a screen capture from the author (although, you are using the \Lab13b_ais dir, not \Lab12b) :
Let it run shouldnt take too long. 15-20 seconds (with an XDS510 emulator). You will see some progress msgs and then see success like this:
GrabBag 16b - 30
Lab 16b: Booting From Flash 20. Ensure DIP switches are set correctly and get music playing, then power-cycle! Make sure ALL DIP switches on S7 are DOWN [OFF]. This will place the EVM into the SPI-1 boot mode. Get some music playing. Power cycle the board and THERE IT GOES No need to re-flash anything like a POST just leave your neat little program in there for some unsuspecting person to stumble on one day when they forget to set the DIP switches back to EMU mode and they automagically hear audio coming out of the speakers when the turn on the power. Freaky. You should see the LED blinking as wellgreat work !! Hint: DO NOT SKIP THE FOLLOWING STEP.
21. Change the boot mode pins on the EVM back to their original state. Please ensure DIP_5 and DIP_8 of S7 (the one on the right) are UP [ON]. RAISE YOUR HAND and get the instructors attention when you have completed this lab. If time permits, move on to the next OPTIONAL part
GrabBag 16b - 31
When running AISgen, you can simply load this config file and it contains ALL of the settings from this lab. Edit, recompile, load this cfg, generate .bin, burn, reset. Quick. Or, you can simply use the .cfg file you saved earlier in this lab
GrabBag 16b - 32
Additional Information
Additional Information
GrabBag 16b - 33
Notes
Notes
GrabBag 16b - 34
Objectives
Objectives
the key APIs used Adapt a TSK to use SIO (Stream I/O) Describe the benefits of multi-buffer streams Learn the basics of PSP drivers
GrabBag - 16c - 1
Module Topics
Module Topics
Stream I/O and Drivers (PSP/IOM) ....................................................................................... 16-1 Module Topics.................................................................................................................... 16-2 Driver I/O - Intro ................................................................................................................. 16-3 Using Double Buffers ......................................................................................................... 16-5 PSP/IOM Drivers ................................................................................................................ 16-7 Additional Information....................................................................................................... 16-10 Notes ............................................................................................................................... 16-12
GrabBag - 16c - 2
GrabBag - 16c - 3
GrabBag - 16c - 4
status = SIO_issue(&sioOut, pOut1, SIZE, NULL); status = SIO_issue(&sioOut, pOut2, SIZE, NULL); //while loop iterate the process while (condition == TRUE){ size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL); size = SIO_reclaim(&sioOut, (Ptr *)&pOutX, NULL); // DSP... to pOut status = SIO_issue(&sioIn, pInX, SIZE, NULL); status = SIO_issue(&sioOut, pOutX, SIZE, NULL); } //epilog wind down the process status = SIO_flush(&sioIn); //stop input status = SIO_idle(&sioOut); //idle output, then stop size = SIO_reclaim(&sioIn, (Ptr *)&pIn1, NULL); size = SIO_reclaim(&sioIn, (Ptr *)&pIn2, NULL); size = SIO_reclaim(&sioOut, (Ptr *)&pOut1, NULL); size = SIO_reclaim(&sioOut, (Ptr *)&pOut2, NULL);
GrabBag - 16c - 5
GrabBag - 16c - 6
PSP/IOM Drivers
PSP/IOM Drivers
GrabBag - 16c - 7
PSP/IOM Drivers
GrabBag - 16c - 8
PSP/IOM Drivers
GrabBag - 16c - 9
Additional Information
Additional Information
GrabBag - 16c - 10
Additional Information
GrabBag - 16c - 11
Notes
Notes
GrabBag - 16c - 12
C66x Introduction
Introduction
This chapter provides a high-level overview of the architecture of the C66x devices along with a brief overview of the MCSDK (Multicore Software Development Kit).
Objectives
Objectives
Describe the basic architecture of the
C66x family of devices Provide an overview of each device subsystem
GrabBag - 16d - 1
Module Topics
Module Topics
C66x Introduction................................................................................................................. 16-1 Module Topics.................................................................................................................... 16-2 C66x Family Overview ....................................................................................................... 16-3 C6000 Roadmap ............................................................................................................ 16-3 C667x Architecture Overview ......................................................................................... 16-4 C665x Low-Power Devices............................................................................................... 16-11 MCSDK Overview ............................................................................................................ 16-13 What is the MCSDK ?................................................................................................... 16-13 Software Architecture ................................................................................................... 16-14 For More Info ................................................................................................................ 16-16 Notes ............................................................................................................................... 16-17 More Notes ................................................................................................................... 16-18
GrabBag - 16d - 2
C67x
IEEE 754 Native Instructions for SP & DP Advanced VLIW architecture
C67x+
2x registers Enhanced floating-point add capabilities
C674x
100% upward object code compatible with C64x, C64x+, C67x and c67x+ Best of fixed-point and floating-point architecture for better system performance and faster time-to-market.
C64x+
SPLOOP and 16-bit instructions for smaller code size Flexible level one memory architecture iDMA for rapid data transfers between local memories
C64x
Advanced fixedpoint instructions Four 16-bit or eight 8-bit MACs Two-level cache
FLOATING-POINT VALUE
FIXED-POINT VALUE
GrabBag - 16d - 3
CorePac
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
CorePac
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
L1D
TeraNet
Multicore Navigator
1 to 8 C66x CorePac DSP Cores operating at up to 1.25 GHz Fixed/Floating-pt operations Code compatible with other C64x+ and C67x+ devices L1 Memory Partition as Cache or RAM 32KB L1P/D per core Dedicated L2 Memory Partition as Cache or RAM 512 KB to 1 MB per core Direct connection to memory subsystem
Network Coprocessor
Memory Subsystem
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
MSMC
Memory Subsystem
CorePac
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
L1D
TeraNet
Multicore Navigator
Network Coprocessor
Multicore Shared Memory (MSM SRAM) 2 to 4MB (Program or Data) Available to all cores Multicore Shared Mem (MSMC) Arbitrates access to shared memory and DDR3 EMIF Provides CorePac access to coprocessors and I//O Provides address extension to 64G (36 bits) DDR3 External Memory Interface (EMIF) 8GB Support for 16/32/64-bit modes Specified at up to 1600 MT/s
GrabBag - 16d - 4
Multicore Navigator
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
MSMC
Multicore Navigator
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
L1D
TeraNet
Multicore Navigator
Queue Manager Packet DMA
Provides seamless inter-core communications (msgs and data) between cores, IP, and peripherals. Fire and forget Low-overhead processing and routing of packet traffic to/from cores and I/O Supports dynamic load optimization Consists of a Queue Manager Subsystem (QMSS) and multiple, dedicated Packet DMA engines
Network Coprocessor
Queue Man register I/F PKTDMA register I/F Accumulator command I/F
VBUS Hardware Block PKTDMA Rx Coh Unit Config RAM Register I/F Rx Channel Ctrl / Fifos Tx Channel Ctrl / Fifos Tx DMA Scheduler Interrupt Distributor Queue Interrupts queue pend Tx Scheduling I/F (AIF2 only) Queue Manager Config RAM Register I/F Rx Core Tx Core Tx Scheduling Control QMSS
Timer APDSP
(Accum)
Rx Streaming I/F Tx Streaming I/F Output (egress) Input (ingress) PKTDMA Control
queue pend
Link RAM
(internal)
GrabBag - 16d - 5
Network Coprocessor
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
MSMC
Network Coprocessor
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
L1D
TeraNet
Multicore Navigator
Switch
SA PA
Network Coprocessor
Provides H/W accelerators to perform L2, L3, L4 processing and encryption (often done in S/W) Packet Accelerator (PA) 8K multi-in/out HW queues Single IP address option UDP/TCP checksum and CRCs Quality of Service (QoS) support Multi-cast to multiple queues Security Accelerator (SA) HW encryption, decryption and authentication Supports protocols: IPsec ESP, IPsec AH, SRTP, 3GPP
External Interfaces
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
SGMII x2
Ethernet Switch
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
External Interfaces
L1D
Multicore Navigator
Application Specific I/O
2x SGMII ports support 10/100/1000 Ethernet 4x SRIO lanes for inter-DSP xfrs SPI for boot operations UART for development/test 2x PCIe at 5Gbps I2C for EPROM at 400 Kbps GPIO App-specific interfaces
TeraNet
SRIO x4
PCIe x2
Ethernet Switch
GPIO
UART
I 2C
SPI
Network Coprocessor
GrabBag - 16d - 6
SGMII x2
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
L1D
TeraNet
Multicore Navigator
Non-blocking switch fabric that enables fast and contention-free data movement Can configure/manage traffic queues and priorities of xfrs while minimizing core involvement High-bandwidth transfers between cores, subsystems, peripherals and memory
Network Coprocessor
HyperLink
256bit TeraNet
S DDR3 S Shared L2 S S S S
HyperLink
MSMC
M M
CPUCLK/2
DDR3
XMC
SRIO M M
Network M Coprocessor
TC2 M TPCC M TC6 TPCC TC3 64ch TC4TC7 M 64ch TC5TC8 QDMA M QDMA TC9 EDMA_1,2 TAC_FE M
Facilitates high-bandwidth communication links between DSP cores, subsystems, peripherals, and memories. Supports parallel orthogonal communication links
M M M M
RAC_BE0,1 RAC_BE0,1 M M FFTC / PktDMA M FFTC / PktDMA M AIF / PktDMA M QMSS PCIe M M
S SVCP2 (x4) (x4) SVCP2 (x4) SVCP2 VCP2 (x4) S S QMSS PCIe
DebugSS
GrabBag - 16d - 7
Diagnostic Enhancements
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
HyperLink
L1D
Diagnostic Enhancements
Embedded Trace Buffers (ETB) enhance CorePacs diagnostic capabilities CP Monitor provides diagnostics on TeraNet data traffic Automatic statistics collection and exporting (non-intrusive) Can monitor individual events Monitor all memory transactions Configure triggers to determine when data is collected
CorePac Memory Subsystem Multicore Navigator Network Coprocessor External Interfaces TeraNet Switch Fabric
TeraNet
Multicore Navigator
Network Coprocessor
HyperLink Bus
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
L1D
CorePac Memory Subsystem Multicore Navigator Network Coprocessor External Interfaces TeraNet Switch Fabric Diagnostic Enhancements
HyperLink Bus
HyperLink
TeraNet
Multicore Navigator
Expands the TeraNet Bus to external devices Supports 4 lanes with up to 12.5Gbaud per lane
Network Coprocessor
GrabBag - 16d - 8
Miscellaneous Elements
Memory Subsystem
64-bit DDR3 EMIF MSM SRAM
Application-Specific Coprocessors
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
L1D
Multicore Navigator
CorePac Memory Subsystem Multicore Navigator Network Coprocessor External Interfaces TeraNet Switch Fabric Diagnostic Enhancements HyperLink Bus
HyperLink
TeraNet
Network Coprocessor
Boot ROM HW Semaphore provides atomic access to shared resources Power Management PLL1 (Corepacs), PLL2 (DDR3), PLL3 (Packet Acceleration) Three EDMA Controllers 16 64-bit Timers Inter-Processor Communication (IPC) Registers
Miscellaneous
C6670 RSA
Coprocessors
MSMC
VCP2
x4 x2
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
L1D
x2
HyperLink
TeraNet
Multicore Navigator
Wireless Applications
CorePac Memory Subsystem Multicore Navigator Network Coprocessor External Interfaces TeraNet Switch Fabric Diagnostic Enhancements HyperLink Bus Miscellaneous Application-Specific
Network Coprocessor
Wireless-specific Coprocessors 2x FFT Coprocessor (FFTC) Turbo Dec/Enc (TCP3D/3E) 4x Viterbi Coprocessor (VCP2) Bit-rate Coprocessor (BCP) 2x Rake Search Accel (RSA)
6x Antenna Interface (AIF2)
AIF2 x6
Wireless-specific Interfaces
GrabBag - 16d - 9
MSMC
C66x CorePac
L1P L2
1 to 8 Cores @ up to 1.25 GHz
L1D
HyperLink
TeraNet
Multicore Navigator
CorePac Memory Subsystem Multicore Navigator Network Coprocessor External Interfaces TeraNet Switch Fabric Diagnostic Enhancements HyperLink Bus Miscellaneous Wireless Applications
Network Coprocessor
2x Telecom Serial Port (TSIP) EMIF 16 (EMIF-A): Connects memory up to 256MB Three modes: Synchronized SRAM NAND Flash NOR Flash
EMIF 16
GrabBag - 16d - 10
TSIP x2
C6655/57
MSMC
Debug & Trace Boot ROM Semaphore Timers Security / Key Manager Pow er Management
C66x CorePac
C66x CorePac
32KB L1 P-Cache 32KB L1 D-Cache
Coprocessors
TCP3d VCP2
PLL EDMA
x2
1024KB L2 Cache
x2
HyperLink
TeraNet
Multicore Navigator
Queue Manager
Packet DMA
EMIF16
GPIO
UPP
UART
I2C
SPI
PCIe
SRIO
Ethernet MAC
SGMII
1.25 GHz Memory Subsystem 1MB Local L2 per core MSMC , 32-bit DDR3 I/F Hardware Coprocessors TCP3d, VCP2 Multicore Navigator Interfaces 2x McBSP, SPI, I2C, UPP, UART 1x 10/100/1000 SGMII port Hyperlink, 4x SRIO, 2x PCIe EMIF 16, GPIO Debug and Trace (ETB/STB)
x2
McBSP x2
x2
x4
MSMC
C6654
C66x CorePac
Debug & Trace Boot ROM Semaphore Timers Security / Key Manager Pow er Management
C66x CorePac
32KB L1 P-Cache 32KB L1 D-Cache
1MB Local L2
Memory Subsystem
PLL EDMA
x2
1024KB L2 Cache
Queue Manager
Packet DMA
Multicore Navigator Interfaces 2x McBSP, SPI, I2C, UPP, UART 1x 10/100/1000 SGMII port EMIF 16, GPIO Debug and Trace (ETB/STB)
x2
McBSP x2
x2 PCIe
Ethernet MAC
EMIF16
GPIO
UPP
UART
I2C
SPI
SGMII
GrabBag - 16d - 11
C6654
0.85 No 1066 No No No No
C6655
1 @ 1.0, 1.25
C6657
2 @ 0.85, 1.0, 1.25
GrabBag - 16d - 12
MCSDK Overview
The Multicore Software Development Kit (MCSDK) provides the core foundational building blocks for customers to quickly start developing embedded applications on TI high performance multicore DSPs.
Uses
What is MCSDK?
the SYS/BIOS or Linux real-time operating system Accelerates customer time to market by focusing on ease of use and performance Provides multicore programming methodologies
Available for free on the TI website bundled in one installer, all the software in the MCSDK is in source form along with pre-built libraries
Host Computer
Target Board
XDS 560 V2 XDS 560 Trace
GrabBag - 16d - 13
MCSDK Overview
Software Architecture
Migrating Development Platform
TI Demo Application on TI Evaluation Platform
Demo Applic ation
Tools (UIA)
Tools (UIA)
Tools (UIA)
Tools (UIA)
LLD
IPC
TI Platform
LLD
IPC
Customer Platform
LLD
IPC
Customer Platform
LLD
IPC
CSL
CSL
CSL
CSL
No modifications required May be used as is or customer can implement value-add modifications Needs to be modified or replaced with customer version
Software may be different, but API remain the same (CSL, LLD, etc.)
BIOS-MCSDK Software
Demonstration Applications HUA/OOB Software Framework Components Interprocessor Instrumentation Communication (MCSA) Algorithm Libraries DSPLIB IMGLIB MATHLIB IO Bmarks Image Processing Communication Protocols TCP/IP Networking (NDK) Platform/EVM Software Platform Library Resource Manager OSAL Transports - IPC - NDK POST Bootloader SYS/BIOS RTOS
Low-Level Drivers (LLDs) EDMA3 PCIe PA QMSS SRIO CPPI FFTC HyperLink TSIP
GrabBag - 16d - 14
MCSDK Overview
Process 2
Process 2
Process 2
Process 1
Process 1
Process 1
Process 1
Process 2
Task to Task x x x x x
Core to Core x x x x x
Device to Device
BIOS
BIOS
BIOS
IPC
IPC
BIOS
IPC
IPC
x x x
Process 2
Process 2
Process 2
Process 1
Process 1
Process 1
Linux
BIOS
BIOS
SysLink
IPC
IPC
BIOS
Process 1
IPC
Process 2
GrabBag - 16d - 15
DVD Contents
Online Collateral
TMS320C667x processor website http://focus.ti.com/docs/prod/folders/print/tms320c6678.html http://focus.ti.com/docs/prod/folders/print/tms320c6670.html MCSDK website for updates http://focus.ti.com/docs/toolsw/folders/print/bioslinuxmcsdk.html CCS v5 http://processors.wiki.ti.com/index.php/Category:Code_Composer_Studio_v5 Developers website Linux: http://linux-c6x.org/ BIOS: http://processors.wiki.ti.com/index.php/BIOS_MCSDK_2.0_User_Guide
Download Software
Users Guide
For questions regarding topics covered in this training, visit the following e2e support forums: http://e2e.ti.com/support/embedded/f/355.aspx http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639.aspx
Software Forums
GrabBag - 16d - 16
Notes
Notes
GrabBag - 16d - 17
More Notes
More Notes
GrabBag - 16d - 18