102
102
單選題
1. Pipelining in CPU design is aimed to provide the optimal
(A) throughput (B) latency (C) space parallelism (D) caching of instruction execution.
Answer: (A)
2. Given a non-pipelined CPU operating at 50 MHz, we are optimizing the CPU with a 4-stage
pipeline design. In an ideal case, what is the operating clock frequency for the 4-stage pipeline
CPU? (A) 50 MHz (B) 100 MHz (C) 150 MHz (D) 200 MHz (E) 400 MHz
Answer: (D)
3. Direct Memory Access (DMA) is (A) worse (B) better (C) the same
in performance for small transfers than interrupt driven I/O.
Answer: (A)
4. Suppose a 32-bit CPU with physical address bus A31 A30 ..... A1 A0 and assume that the data cache
has the following structure:
Cache structure is set associative with 2 lots per set
Cache size is 128 KBytes
Cache block size is 16 Bytes and a word is 32-bit (4 Bytes)
Cache is indexed with physical address
The address bits (A) A16 A15 ...A4 (B) A15 A14 ...A4 (C) A14 A13 ...A4 (D) A13 A12 ...A4 (E) A15 A14 ...A3
are used as index address for the cache.
Answer: (B)
Remark: Number of sets = 128KB / (16B 2) = 4K
Answer: (A)
Remark: Number of sets = 128KB / (16B) = 8K
Answer: (B)
Remark: Execution time = [(5 – 1) + 10G + 2.2 10G] /1G = 32
7. What instruction in ARM processors does not affect the conditional code?
(A) ADDS r0, r1, r2 (B) ADD r0, r1, r2 (C) CMP r1, r2 (D) TST r1, r2
Answer: (B)
8. An application spends 80% of its time doing multiply instructions. If the multiplier is sped up by
4 times, the application will run (A) 5 (B) 4 (C) 3 (D) 2.5 (E) 2 times faster.
Answer: (D)
9. An application spends 80% of its time doing multiply instructions. If the multiplier is sped up by
infinite times, the application will run (A) 5 (B) 4 (C) 3 (D) 2.5 (E) 2 times faster.
Answer: (A)
複選題
10. (A) instruction pre- fetch (B) instruction buffer (C) branching (D) cache misses (B) resource
constraints will keep the pipeline from being full.
12. If we run the following program on a 32-bit machine, what outputs might be generated?
#include <stdio.h>
int main ( )
{
int A[3] = {1, 2, 3};
int *ptr;
ptr A;
printf(" %p : %d \n", ptr, *ptr) ;
ptr++;
printf(" %p : %d \n", ptr, *ptr) ;
return 0;
}
(A) 0xbfe5a870:1
0xbfe5a874:2
(B) 0xbfela5e0:l
0xbfela5e4:2
(C) 0xbfe5a870:2
0xbfe5a874:3
(D) 0xbfela5e0 :2
0xbfela5e4:3
(E) 0xbfe5a870:1
0xbfe5a874:3
Answer
(1) The average CPI of the original processor = 6 0.2 + 9 0.3 + 15 0.5 = 11.4
X X X
1× 11.4 2× 6− × 0.6+ 9− × 0.2+ 15− × 0.2
(2) > 100 50 20
X > 109.09
100 X
(3) The average CPI after adding cache for original processor = 11.4 – 2.5 + 0.5 = 9.4
X X X
1× 9.4 2× 6− × 0.6+ 9− × 0.2+ 15 − × 0.2
> 100 50 20
X > 125.37
100 X
Answer
(1)
Instruction Opcode rs rt rd shamt funct
addi $15, $0, -8 00000 01111 1111 1111 1111 1000
sltu $16, $15, $0 01111 00000 10000 00000
jalr $16, $15 10000 00000 01111 00000
(2) addi instruction datapath:
4 31-26 To control
32 bits unit ALUOP
Add 32 bits
25-21 (15)
RR1 RD1
20-16 (0)
P WR
ALU
C Instr. WD
Mem. 32 bits
SE
15-0 (-8) 32 bits
4 31-26 To main
32 bits control unit
Add ALUOP
RegWrite
25-21 (15) 32 bits
RR1 RD1
P 20-16 (0)
RR2
C Instr. 15-11 (16) ALU
WR RD2
Mem. 32 bits
WD 32 bits
5-0
To ALU control unit
jalr instruction datapath:
4
Add 31-26 To main
control unit
25-21 (16)
RR1 RD1
P 15-11 (15)
C Instr. WR
RD2
Mem.
WD
32 bits
32 bits
3. Suppose there exists a 12-bit IEEE 754 floating point format, with 1 sign bit, 6 exponent bits,
and 5 mantissa bits.
(1) How would - be represented in this 12-bit format? And what is the smallest positive
normalized number. Give the value in decimal of the second number, and show both
either as 12 bits or as 3 hexadecimal digits.
(2) Give the nearest representation n of 5.612 in this format.
(3) What is the actual value of n? Hence, work out its relative error r, to 3 significant digits.
You may use the fact that a / 5.612 ≈ a 0.1782.
(4) Calculate n2 using binary floating point multiplication. Show rounding, normalization
and where you might check for overflow. Give the result as a 12-bit IEEE 754 number.
Answer
(1)
Decimal value Binary format Hexadecimal format
Smallest positive - 1 111111 00000 FE0
normalized number 1.0 2 -30
0 000001 00000 020
(2) 5.61210 = 101.10011…. 2 = 1.0110011…. 22
0
0 100001 01101
(3) The actual value of n = 101.1012 = 5.625
r = (5.625 − 5.612) / 5.612 ≈ 0.013 × 0.1782 ≈ 2.32 × 10−3
(4)
No normalization necessary; 6-bit excess-31 exponent can accommodate 4 = 1000112 , so no
overflow. Result: 0 100011 111112 = 11111.12 = 31.510
4. The simple datapath with the control unit for the MIPS processor is shown below. The input
to the ALU control unit is the 6-bit opcode field from the instruction.
Add
4 Add Sum
Shift PCSrc
left 2
Instruction [31-26]
Control
Instruction [5-0]
Answer
(1)
Memto- Reg- Mem- Mem-
Instruction RegDst ALUSrc Branch
Reg Write Read Write
add 1 0 0 1 0 0 0
lw 0 1 1 1 1 0 0
sw 1 0 0 1 0
beq 0 0 0 0 1
(2) If the instruction beq is taken, the new PC address = 52 + 8 4 = 84
If the instruction beq is not taken, the new PC address = 52
(3) The clock cycle time is 2220 ps.
Instruction Latency
add 500 + 220 + 100 + 180 + 100 + 220 = 1300 (ps)
beq 500 + 220 + 100 + 180 + 100 =1100 (ps)
lw 500 + 220 + 180 + 1000 + 100 + 220 = 2220 (ps)
sw 500 + 220 + 180 + 1000 =1900 (ps)
(4) 2220 – 500 – 1000 = 720 (ps)
(5) The RAW data dependency between instruction pairs (add, sw) and (add, beq) will affect the
correctness of the given instructions execution.
(6) Delayed branch is a type of branch where the instruction immediately following the branch is
always executed, independent of whether the branch condition is true or false.
The following shows the code after “from before” scheduling.
lw $1, 50($7)
add $4, $5, $6
beq $1, $4, 8
sw $4, 50($7)
5. In the original processor shown in problem 4, some actions will be taken when an exception occurs.
Please pickup the right things from the following table and give correct order about those actions.
(ex: (a) (b) (c))
(a) stopping the execution of the program (b) automatically execute a jal instruction
(c) save all register values into the stack (d) execute the predefined actions for exceptions
(e) flushing all instructions (f) transfer the control to OS at some specified address
(g) automatically stall one cycle (h) save the address of the offending instruction in EPC
6. In this problem, we assume that the processor in problem 4 has been pipelined into 5 stages: IF,
ID, EX, MEM, and WB with necessary pipeline registers.
(1) For the datapath shown in problem4, please modify the datapath to implement the forwarding
capability that can eliminate the stalls due the data dependency in the code sequence given
in problem 4. In order to simplify the answers, only the datapath of EX and MEM stages
are required to be redrawn on your answer sheet.
(2) Please explain the meanings of the added signals in (1) with the forwarding capability. The
triggering conditions of the added control signals are also required.
(3) With the forwarding capability in (1), please give the values of the control
signals (RegDst, ALUSrc, MemtoReg, RegWrite, Branch) at the fifth cycle while
executing the given code sequence in problem 4.
Answer
(1)
ID/EX
EX/MEM
MEM/WB
RegWrite
Data
ALU memory
FB FA
Rs
Rt
Rt EX/MEM.RegisterRd
Rd
Forwarding
unit
(2)
Input Bits Usage
ID/EX.RegisterRs 5 operand reg number, compare to see if match
ID/EX.RegisterRt 5 operand reg number, compare to see if match
EX/MEM.RegisterRd 5 destination reg number, compare to see if match
EX/MEM.RegWrite 1 TRUE if writes to the destination reg
Output Bits Usage
FA 2 forwarding signal
FB 2 forwarding signal
(3)
Control signal RegDst ALUSrc MemtoReg RegWrite Branch
Value X 1 1 1 0
7. State whether the following techniques are associated primarily with a software- or
hardware-based approach to exploiting ILP (Instruction-Level Parallelism). If it is associated
with a software-based approach, please give an 'S' as the answer. If it is associated with a
hardware-based approach, please give an 'H' as the answer. In some cases, the answer may be
both.
(1) Branch prediction
(2) Register renaming
(3) Speculation
(4) Superscalar
(5) VLIW
Answer
(1) (2) (3) (4) (5)
B B B H S
8. Memory Hierarchy:
(1) What is the advantage of separating instruction cache and data cache when comparing to
the unified cache?
(2) Please state why it is "virtual" in the so-called virtual memory?
(3) A computer system has 1G bytes main memory and 512K bytes cache with 2-way set
associativity and 4-byte block size. The partial content of the cache is listed as follows at
a specific time instance. Please identify the possible two lowest addresses of the data that
is shaded (i.e. D2).
(Note: 1... 1 denotes consecutive 1 s and 0.. .0 denotes consecutive 0s)
Index Tag Data(11) Data(10) Data(01) Data(00)
100…0 81 82 83 84
0
011…1 91 92 93 94
000…0 A1 A2 A3 A4
1
011…1 B1 B2 B3 B4
111…1 C1 C2 C3 C4
2
100…0 D1 D2 D3 D4
100…0 E1 E2 E3 E4
3
000…0 F1 F2 F3 F4
Answer
(1) Split cache compares with unified cache has the advantage of doubling the cache bandwidth.
(2) In virtual memory system, all physical memory is controlled by the operating system. Some
memory can be stored in physical RAM chips while other memory is stored on a hard drive.
When a program needs memory, it requests it from the operating system. Computer
programmers no longer need to worry about where the memory is physically stored or whether
the user's computer will have enough memory. The word “virtual” is used to represent the fact
that some memory is actually stored on a hard drive.
(3)
Address format Tag Index offset
Field length 12 bits 16 bits 2 bits
Binary value 100000000000 0000000000000010 10
102 交大資聯
單選題
1. Which one of the following statements is TRUE?
(a) From the point of view of a network service provider, latency is a suitable metric to
measure the performance of the network service server.
(b) Mini-computer has disappeared from the computer market since main- frame computers
have been well designed to have much more computing power than mini-computers.
(c) The CPU operating clock rate has not been continuously and significantly improved these
years because Moor's law has been invalidate since several years ago.
(d) CPU and RF/analog designs are generally regarded as two main streams/representatives
that can represent the semiconductor technology level of a nation.
(e) With the fixed number of defects per area, a small die-area IC design has better
manufacture yield than the big die-area one.
Answer: (e)
Answer: (e)
Remark(a): Same ISA means same IC and same circuit design means same CPI.
3. Given the values of registers $t0 and $t1 in the table below, which of following statements is
correct.
$t0 = 0xAAAAAAAA
$t1 = 0x22222222
(a) For the initial values of registers in the table above, the value of$t2 will become
0x55555552 after the execution of code sequence below.
sll $t2, $t0, 3
sr $t2, $t2, $t1
(b) For the initial values of registers in the table above, the value of$t2 will become
0x37777777 after the execution of code sequence below.
srl $t2, $t0, 3
or $t2, $t2, $t1
(c) For the initial values of registers in the table above, the value of $t2 will become
0xF7777777 after the execution of code sequence below.
srl $t2, $t0, 3
or $t2, $t2, $t1
(d) For the initial values of registers in the table above, the value of$t2 will become 0 after
the execution of instruction below.
slt $t2, $t0, $t1
(e) For the initial values of registers in the table above, the value of$t2 will become 1 after
the execution of instruction below.
sltu $t2, $t0, $t1
Answer: (b)
4. Suppose the program counter (PC) is at address 0x80000000, which of following statements
is incorrect.
(a) It is possible to use one branch instruction to get to address 0x7fffff00
(b) It is possible to use one jump instruction to get to address 0x7fffff00
(c) It is possible to use one branch instruction to get to address 0x80000040
(d) It is possible to use one branch instruction to get to address 0x80010000
(e) It is possible to use one jump instruction to get to address 0x8fff0000
Answer: (b)
Answer: (c)
Remark(c): 5 times
複選題
6. Each bit’s carry out of a carry lookahead adder (CLA) is a function of each bit’s generate,
propagate, and carry in signals. Consider a two- level 9-bit CLA that is composed of several
3-bit CLA units with c0 as the carry in and c1 to c9 as the carry outs. Each CLA unit realizes
3-bit group generate and group propagate functions based on 3-bit inputs, A signal is regarded
validate if its value is correct; or, it is regarded as invalidate. Signals gi and pi are the generate
and propagate signals at bit i while G i-j and Pi-j are the group generate and group propagate
signals associated with bits i to j. Assume all input signals have been available, which ones of
the following statements are correct ?
(a) A number of 4 CLA units in total are required to assemble the two- level 9-bit CLA.
(b) Carry outs c1 , c2 and c3 become validate earlier than other carry out signals.
(c) G0-2 = g2 + p2 g1 + p2 p1 g0 and P0-2 = p2 p1 p0 .
(d) Carry outs c4 and c5 become validate earlier than c6 .
(e) Assume the delays of each NOT, isolated AND, XOR, AND-OR structure gates are 1, 2, 2,
and 3 time units respectively, and the sum is realized by 2 XOR gates while the propagate
is realized by a XOR gate. Then the longest delay of the 2- level 9-bit CLA is the 12 time
units.
m3
m2 m3
m1
m4
Remark(d):
Word Block
Tag Index Hit/Miss 3C
address address
0 0 0 0 Miss Compulsory
40 10 1 2 Miss Compulsory
80 20 2 4 Miss Compulsory
340 85 10 5 Miss Compulsory
56 14 1 6 Miss Compulsory
68 17 2 1 Miss Compulsory
172 43 5 3 Miss Compulsory
348 87 10 7 Miss Compulsory
48 12 1 4 Miss Compulsory
80 20 2 4 Miss Conflict
Remark(e):
Word Block
Tag Index Hit/Miss 3C
address address
0 0 0 0 Miss Compulsory
40 10 1 1 Miss Compulsory
80 20 2 2 Miss Compulsory
340 85 9 4 Miss Compulsory
56 14 1 5 Miss Compulsory
68 17 1 8 Miss Compulsory
172 43 4 7 Miss Compulsory
348 87 9 6 Miss Compulsory
48 12 1 3 Miss Compulsory
80 20 2 2 Hit
9. A system has a 16-bit virtual address size with 256B pages and a 16-bit byte addressing
physical memory with physically index 2-way set associative data cache (LRU replacement
policy) of512B cache size and 16B blocks. The first six accesses to the page table are: 0x1332,
0xad58, 0x0162, 0x2ea9, 0x813d, 0x9f5a. The system uses a single level page table. Assume
the TLB and data cache are initially empty and their contents have been updated accordingly
by the first six accesses (page table is shown below; VPN: virtual page number, PPN: physical
page number). The CPU includes a fully associative TLB with 2 entries and an LRU
replacement policy. Data reads from the following virtual addresses are performed (in the
order listed): 0x813f, 0xl36f, 0x2e5f, 0x9f50, 0x1370. Which ones of following statements are
correct?
VPN PPN VPN PPN
0x01 0x27 0x81 0x8a
0x13 0x45 0x9f 0xcd
0x2e 0xe8 0xad 0x7e
(a) The required numbers of bits for virtual page number and page offset in virtual address
are all 8 bits.
(b) The required number of bits for index and tag in physical address are 4 and 8 bits
respectively.
(c) Data access to 0x813fhas access hit in TLB and cache.
(d) Data access to 0X9f50 has access miss in TLB and access hit in cache.
(e) Data access to 0X1370 has access miss in TLB and cache.
Remark:
Virtual address Physical address
TLB Cache
VPN PO tag index offset
(a), (b) 8bits 8bits 8bits 4bits 4bits H/M H/M
13 32 45 3 2 M M
ad 58 7e 5 8 M M
01 62 27 6 2 M M
2e a9 e8 a 9 M M
81 3d 8a 3 d M M
9f 5a cd 5 a M M
(c) 81 3f 8a 3 f H H
13 6f 45 6 f M M
2e 5f e8 5 f M M
(d) 9f 50 cd 5 0 M H
(e) 13 70 45 7 0 M M
10. For the Single-Cycle MIPS CPU shown below, which of following statements are correct?
PCSrc
Add
4 Add Sum
RegWrite Shift
left 2
Instruction [25-21]
Read Read MemWrite
PC
address Instruction [20-16] register 2 Read
Instruction Read data 1 ALUSrc MemtoReg
[31-0] register 1 ALU Address
Read
Instruction Write data 2 Read
memory Register data
Write
Instruction [15-11]
data Registers A Write
data Data
RegDst
memory
Instruction [15-0] 16 Sign 32 B
extend
MemRead
(a) If the path labeled A has been cut, the instructions add, slt, and sw still can run correctly.
(b) If the path labeled B has been cut, the instructions lw, sw and beq may fail.
(c) If the control signal ALUsrc is stuck on 0, the instructions lw, sw, and beq may fail to run
correctly.
(d) If the control signal RegDst is stuck on 0, the instructions lw and sw still can run
correctly.
(e) If the control signal MemToReg is stuck on 1, the instructions add, sub, lw, and slt may
fail to run correctly.
IF/IDWrite
ID/EX
EX/MEM
Control MEM/WB
PCWrite
IF/ID 0
Instruction
Registers
Instruction ALU
PC Data
memory
memory
A
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt EX/MEM.
Rt RegisterRd
IF/ID.RegisterRd Rd
ID/EX.RegisterRt
Rt
B
Forwarding MEM/WB.RegisterRd
Rs
unit
(a) Assume the path labeled A has been cut, the following code snippet will fail.
add $s2, $s0, $s1
add $s3, $s0, $s2
(b) Assume the path labeled A has been cut, the following code snippet will fail.
add $t1, $t1, $s2
add $t1, $t0, $t2
add $s2, $s1, $t1
(c) Assume the path labeled B has been cut, the following code snippet will fail.
add $t1, $t1, $s2
add $t1, $t0, $t2
add $s2, $s1, $t1
(d) Assume the path labeled A has been cut, the following code snippet will fail.
lw $s0, 4($t1)
add $s2, $s0, $s1
(e) Assume the path labeled B has been cut, the following code snippet will fail.
lw $s0, 4($t1)
add $s2, $s0, $s1
Answer: (d)
Remark(a): (5 – 1) + 4 + 1 + 1 = 10
Remark(d): (5 – 1) + 4 + 1 + 2 = 11
102清大資工
1. Given the table of latencies of individual stages of a MIPS instruction, please answer the
following questions:
Instruction fetch Register read ALU operation Data Access Register write
200ps 100ps 250ps 400ps 100ps
(a) What is the clock cycle time in a nonpipelined and pipelined processor?
(b) What is the total latency of a lw instruction in a nonpipelined and pipelined processor?
(c) If the time for an ALU operation can be shortened/increased by 25%, how much
speedup/slowdown will it be from pipelining? Explain the reason.
(d) If you can split one stage into two, each with half the origina l latency, to improve the
pipelining performance. Which stage would you split and what is the new clock cycle
time in a pipelined processor?
Answer
(a) The clock cycle time in a nonpipelined processor is 1050ps
The clock cycle time in a pipelined processor is 400ps
(b) The total latency of a lw instruction in a nonpipelined processor is 1050ps
The total latency of a lw instruction in a pipelined processor is 2000ps
(c) Shortening/Increasing the ALU operation will not affect the speedup /slowdown obtained
from pipelining. It would not affect the clock cycle.
(d) Data Access stage
New clock cycle time = 250ps
2. Please answer the following questions about interfacing I/O devices to the processor and
memory:
(a) Describe the following terminologies:
• Memory- mapped I/O.
• I/O instruction.
• Device polling.
• Interrupt-driven communication.
(b) Which communication pattern is most appropriate for a "Video Game Controller"?
Explain.
(c) Prioritize the following interrupts caused by different devices:
• Mouse Controller.
• Reboot.
• Overheat.
Answer
(a)
Memory-mapped I/O: An I/O scheme in which portions of address space are assigned to
I/O devices and reads and writes to those addresses are interpreted as commands to the
I/O device.
I/O instruction: A dedicated instruction that is used to give a command to an I/O device and
that specifies both the device number and the command word (or the location of the
command word in memory).
Device polling: The process of periodically checking the status of an I/O device to determine
the need to service the device.
Inte rrupt-driven communication: An I/O scheme that employs interrupts to indicate to the
processor that an I/O device needs attention.
(b) Device polling
(c) Mouse Controller 3
Reboot 2
Overheat 1
3. In general, a fully associative cache is better than a direct-mapped cache in term of miss
rate. However, it is not always the case for a cache, especially for a small cache. Please
design an example to demonstrate that a direct- mapped cache outperforms a fully
associative cache in term of miss rate under the replacement policy. Least Recently Used
(LRU).
Answer: Suppose the cache has 2 blocks and CPU has the follow references of block address
1, 2, 3, 4, 5, 1, 2, 5, 1, 2, 3, 4, 5
Answer
IC old IC old
(a) = = 1.01523
IC new IC old × 1−0.05× 0.3
Machine A before adding the new addressing mode is faster by 1.01455 times.
Answer
(a)
A’
B
C’
D’
A
B’
C
O1
F
G
H
E’
0
(b) Since there are total 8 Boolean variables included in the 3 functions, the length of the ROM
address is 8 bits. The ROM size would be 28 3 bits = 768 bits
6. Assume the floating point is stored in a 32 bits wide format. The leftmost bit is the sign bit,
the exponent of 16 in excess-64 format is 7 bits wide, and the mantissa is 24 bits long and a
hidden 1 is assumed.
(a) What is the smallest positive number representable in this format (write the answer in
hex and decimal)?
(b) What is the largest positive number representable (write the answer in hex and
decimal)?
(c) Assume A = 2.34375 l0-2 and B = 2.9015625 102 . Write down the bit pattern using
this format for A and B.
(d) Using the format presented in (c), calculate the A – B.
Answer
Answer
Automatic garbage collection by hardware: The execution time of the new machine is (1 − 0.3) ×
1.4 = 0.98 times that of the original.
Special garbage collection instructions: The execution time of the new machine is (1 – 2 0.3/3)
× 1.2 = 0.96 times that of the original.
Therefore, the second option is the best choice.
102成大電機
Choose the correct answers for the following multiple choice problems. Each question may have more
than one answer, 10 points each, no partial point, no penalty.
1. Which of the following is (are) true for a 128KB cache with a line size of 64 bytes? Assume that
the cacheable memory is 4 GB. Address bits are numbered from A0 to A31
(a) In a direct- mapped implementation, the index field uses address bit A6 to A16 for cache line
selection.
(b) In a direct- mapped implementation, the tag length is 13 bits.
(c) In a direct- mapped implementation, the tag length is 15 bits; the field determining the line
size is 6 bits in length.
(d) In a direct- mapped implementation, the total tag size is 30720 bits.
Answer: (c)
Answer: (b)
4. Which of the following is (are) true for virtual memory system?
(a) Virtual memory function can be enabled through software control.
(b) Virtual memory technique treats part of the main memory as a fully-set associative
write-back cache for program execution.
(c) A translation lookaside buffer can be seen as the cache of a page table.
(d) A page table is shared among the programs in execution.
Answer: (d)
Remark: When the PC powers on, the processor's registers are set to some predefined values. One of
the registers is the instruction pointer (or program counter) register and its value after a power on
is well defined: it is a 32-bit value of 0xfffffff0. The instruction pointer register points to code to
be executed by the processor. The computer's hardware translates this address so that it points to a
BIOS memory block.
BIOS is a chip on the motherboard that has a relatively small amount of read-only memory (ROM).
This memory contains various low- level routines that are specific to the hardware supplied with
the motherboard. So, the processor will first jump to the address 0xfffffff0, which really resides in
the BIOS's memory. Usually this address contains a jump instruction to the BIOS's POST routines.
POST stands for Power On Self Test. This is a set of routines including the memory check, system
bus check and other low-level stuff so that the CPU can initialize the computer properly. The
important step on this stage is determining the boot device. All modern BIOS's allow the boot
device to be set manually, so you can boot from a floppy, CD-ROM, harddisk etc.
Remark(c): The BIOS contains the first instruction being fetched.
6. Which of the following is (are) true about instruction set architecture (ISA)?
(a) ISA is an abstraction which is the interface between the hardware and the low- level software
(assembly instructions).
(b) ISA enables the different implementations of a processor using the same ISA to run identical
software.
(c) ISA specifies how a processor pipeline should be implemented.
(d) MIPS and ARM are both RISC-type ISA and use the same instructions set architecture.
Answer: (b)
Remark: ISA is an abstraction which is the interface between the hardware and the low- level software
(Machine language).
7. Which of the following is (are) true about page fault and TLB exception?
(a) A page fault is signaled by operating system so that the OS can fetch the missing page.
(b) A TLB exception is handled by the DMA controller in a hard drive.
(c) Both a page fault and a TLB miss are exceptions and signaled by the processor hardware.
(d) When a requested page is not found in the main memory, this causes a page fault.
Remark: A page fault is a trap to the software raised by the hardware. The hardware that detects a
page fault is the memory management unit in a processor. The exception handling software that
handles the page fault is generally part of the operating system.
9. Which of the following is (are) true about the label specified in a branch instruction, for instance,
beq r1, r2, label?
(a) The label is a pseudo- instruction and the compiler compiles it into a memory location.
(b) The label is transformed into an offset specifying the difference between the branch
instruction (or the instruction after the branch) and the branch target.
(c) The label is transformed into an absolute address in memory for the branch target.
(d) If the label is an external reference, the linker can figure out its value and finalize the binary
format for the branch instruction.
Answer: (a)
102成大電通
1. Read the following paragraph and choose the correct answers from the following multiple
choice questions. Each question may have more than one answer. No partial point, no penalty.
“Just as CPU programmers were forced to explicitly manage CPU memories in the days before
virtual memory, for almost a decade, GPU programmers directly and explicitly managed the
GPU memory hierarchy. The recent release of NVIDIA's Fermi architecture, however, has
brought GPUs to an inflection point: it implements a unified address space that eliminates the
need for explicit memory movement to and from GPU memory structures. Yet, without demand
paging, something taken for granted in the CPU space, programmers must still explicitly reason
about available memory. The drawbacks of exposing physical memory size to programmers are
well known. Which of the followings are true?”
(a) Without virtual memory technology, CPU programmers must explicitly manage the
physical memory.
(b) There is no need for the programmers of the Fermi architecture to wo rry about the available
memory to use.
(c) With virtual memory, a CPU programmer implements demand paging policy in his/her
program.
(d) GPU stands for General Processing Unit.
Answer: (a)
Answer
(a) The even when an accessed page in not present in main memory.
(b) The even when there is no matching entry in the TLB for a page.
(c) Each active process has its own page table.
3. For CPUs, the problem of exception support was solved at a relatively early stage. This support
was a key enabler to their success, and instrumental in this success was the definition of precise
exception handling. In a pipelined processor, multiple exceptions may occur at the same clock
cycle. Assume this is a five-stage pipeline. Write down which exception (if any) may occur at
the IF, ID, EXE, MEM, WB stage.
Answer
Stage Exception
IF (1) Instruction TLB miss (2) Hardware malfunction
ID (1) Undefined Instruction (2) Hardware malfunction
EXE (1) Arithmetic Overflow (2) Hardware malfunction
MEM (1) Memory protection error (2) Data TLB miss (3) Hardware malfunction
WB (1) Hardware malfunction
102 成大資工
1. Please describe the steps involved in developing and executing assembly language programs?
Answer
Step1: Write an assembly language program.
Step2: The assembler translates assembly language statements to their binary equivalents (object
code).
Step3: Separately assembled modules are combined into one single load module, by the linker.
Step4: The program loader copies the program into the computer's main memory, and at execution
time, program execution begins.
2. Please write the MIPS code to compute the mathematic formula a × b + 3c – 10.
Answer:
Assume that the values of variables a, b, and c are contained in registers $s0, $s1, and $s2,
respectively, and the computation result will be included in register $t0 and is smaller than 32 bit
binary number can represent.
mul $t0, $s0, $s1
sll $t1, $s2, 1
add $t1, $t1, $s2
add $t0, $t0, $t1
addi $t0, $t0, -10
3. (a) If processor A has a higher clock rate than processor B, and processor A also has a higher
MIPS rating than processor B, explain whether processor A will always execute faster than
processor B.
(b) Computer A has an overall CPI of 1.5 and can be run at a clock rate of 700MHz. Computer
B has a CPI of 2.0 and can be run at a clock rate of 650MHz. We have a particular
program to run and this program has exactly 120,000 instructions when compiled for
computer A. How many instructions would the program need to have when compiled for
Computer B if we want the two computers to have exactly the same execution time for this
program?
Answer
(a) We cannot differentiate which machine is faster from the measure of MIPS before the
capabilities of the ISA of these two machines are given.
(b) Suppose ICB represents instruction count for Computer B
Execution time for Computer A= Execution time for Computer B (120000 1.5) / 700M =
(ICB 2) / 650M ICB = 83571.429
4. (a) What are the two characteristics of program memory accesses that caches exploit?
(b) What are three types of cache misses?
Answer
(a) Temporal locality: if an item is referenced, it will tend to be referenced again soon.
Spatial locality: if an item is referenced, items whose addresses are close by will tend to be
referenced soon.
(b) Compulsory misses: a cache miss caused by the first access to a block that has never been in
the cache.
Capacity misses: a cache miss that occurs because the cache even with fully associativity, can
not contain all the block needed to satisfy the request.
Conflict misses: a cache miss that occurs in a set-associative or direct- mapped cache when
multiple blocks compete for the same set.
Answer
2. About the performance analysis, which of the following statement(s) should be true?
(a) Assume that a C program is compiled into 1000 machine instructions, which are related to
the size of the executable file. Then, the average execution time is usually equal to
multiplying 1000 (instructions) by its average CPI and the clock cyc le time.
(b) In evaluating the performance by using the benchmark tests, we usually use the geometric
mean to calculate the average value among various test results.
(c) MIPS is not a reliable metric since it provides the wrong results when we compare the
performances of a compiled program running on two machines with the same instruction set
architecture.
(d) It is usually a preferred approach to consider only one of the three factors: clock rate, CPI or
the instruction count, and then try to improve it to decrease the execution time. That is
called a divide-and-conquer methodology.
(e) None of the above is true.
Answer: (e)
3. About the memory hierarchy, which of the following statement(s) should be true?
(a) A page fault happens if a page that is not in the physical memory is accessed.
(b) TLB is a data cache of main memory.
(c) Adding a secondary cache can help to reduce the miss rate.
(d) We can always increase the size of cache to improve the cache performance.
(e) None of the above is true.
Answer: (a)
4. What are the advantages of register-register ALU instruction type compared with
register- memory ALU instruction type and memory- memory ALU instruction type?
(a) Data can be accessed without loading first.
(b) Fixed- length
(c) Simple code-generation
(d) Instructions take similar Clock Cycle per Instruction (CPI) to execute.
(e) Large variation in instruction size
5. What are the properties critical to program correctness that are normally preserved by control
dependency?
(a) Temporal locality
(b) Spatial locality
(c) Exception behavior
(d) Dataflow
(e) None of the above
單選題
7. Assume that a computer uses 32-bit addressing and has a 1MB cache with the block size equal to
32 bytes. The size of the Tag field in the address for a direct mapped cache is L. The size of the
Tag field in the address for an 8 way set associative cache is M. The size of the Tag field in the
address for a fully associative cache is N. L + M + N = K and
(a) K < 50
(b) 50 ≤ K < 55
(c) 55 ≤ K < 60
(d) 60 ≤ K < 65
(e) 65 ≤ K
Answer: (b)
Remark: L = 12; M = 15; N = 27
8. A cache with 64 blocks and a block size of 32 bytes. The byte address 2473537 maps to the
block number K (0 ≤ K < 64). K mod 5 equals to?
(a) 0
(b) 1
(c) 2
(d) 3
(e) 4
Answer: (a)
Remark: K = 50
9. The following table shows the information about a machine executing instructions. There are 5
classes of instructions, each of which takes 3-5 steps.
Register Memory Register
Instruction Class Fetch ALU
Read access Write
Load Word (LW) 2ns Ins 2ns 3ns Ins
Store Word (SW) 2ns Ins 2ns 3ns
R-Format (R) 2ns Ins 2ns Ins
Branch (B) 2ns Ins 2ns
Jump (J) 2ns Ins 2ns
For a certain program, 25% of the instructions are LW, 10% are SW, 45% are R, 15% are B, 5%
are J. Consider the following three cases: (i) If there is no pipelining, the average time required
for executing one instruction is L (ns). (ii) If a 5-stage pipeline is adopted, the average time
required for executing one instruction in the ideal case is M (ns). (iii) Assume that the
destinations of Branch (B) and Jump (J) will not be known until the end of the ALU stage. If
every B or J instruction incurs two pipeline bubbles, the average time required for executing one
instruction is N (ns). L + M + N = K. Then
(a) K < 13
(b) 13 ≤ K < 14
(c) 14 ≤ K < 15
(d) 15 ≤ K < 16
(e) 16 ≤ K.
Answer: (e)
Remark: L = 9 ns; M = 3 ns; N = 4.2 ns
10. Suppose that there is a machine with 32-bit addresses and a two- level page table. Note that the
first 10 bits of an address is an index into the first level page table and the next 10 bits are an
index into a second level page table. The page tables are stored in memory and each entry in the
page tables is 4 bytes. Suppose the space allocated to a specific process is 64 Mbytes. How many
pages are occupied by this process?
(a) 226 pages
(b) 214 pages
(c) 212 pages
(d) 210 pages
(e) None of the above
Answer: (b)
Remark: Page offset = 32 – 10 – 10 = 12 page size = 4KB number of pages = 226 / 212 = 214
11. (Continued from the previous question.) How much space is occupied in memory by the page
tables for the specific process?
(a) 68 kbytes
(b) 64 kbytes
(c) 16 kbytes
(d) 4 kbytes
(e) None of the above
Answer: (e)
Remark: The size of a small page table = 210 4B = 4KB
Since only one small page table in each level will reside in memory, the space occupied for
the page table is 2 4KB = 8KB
12. Assume that an un-pipelined machine has 9ns clock cycles. The machine uses four cycles for
ALU operations, four cycles for branches, and five cycles for memory operations. The relative
frequencies of these operations are 30%, 20%, and 50%, respectively. Suppose that pipelining
the machines adds Ins of overhead to the clock cycle time. Assume that the ideal CPI is one.
Ignore any other impact. What is the average clock cycle per instruction (CPI) of the
un-pipelined machine?
(a) 5.5
(b) 4
(c) 3.5
(d) 4.5
(e) 5
Answer: (d)
13. (Continued from the previous question.) What is the speedup in the instruction execution rate
gained from pipelining the machine?
(a) 4.05
(b) 4.75
(c) 5.15
(d) 5.85
(e) 6.28
Answer: (a)
102中山資工
NOTE: If some questions are unclear or not well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.
1. You are the lead designer of a new processor. The processor design and compiler are
complete, and now you must decide whether to produce the current design as it stands or
spend additional time to improve it. Please consider and answer the following problems.
(1) You discuss this problem with your hardware engineering team and arrive at the
following options:
(a) Leave the design as it stands. Call this base machine Mbase. It has a clock rate of 500
MHz.
(b) Optimize the hardware. The hardware team claims that it can improve the processor
design to give it a clock rate of 600 MHz. Call this machine Mopt.
The following measurements have been made using a simulator for Mbase and Mopt.
CPI
Instruction class Frequency
Mbase Mopt
A 40% 2 2
B 25% 3 2
C 25% 3 3
D 10% 5 4
What are the CPI and MIPS (million instructions per second) rate for Mbase and Mopt
Copy Table 1-1 to your answer sheet and fill in the blanks with your answers.
(2) The compiler team proposes to improve the compiler for the machine to further enhance
performance. Call this combination of the improved compiler and the base machine
Mcomp . The instruction improvements from this enhanced compiler have been estimated
as follows.
For example, if the base machine executed 500 class A instructions, Mcomp would execute
0.9 500 = 450 class A instructions for the same program. What is the CPI for Mcomp ?
How much faster is Mcomp than Mbase? Copy Table 1-2 to your answer sheet and fill in the
blanks with your answers.
Table 1-1:
Machine CPI MIPS rate
Mbase
Mopt
Table 1-2:
Machine CPI Execution time Mbase 1 Execution time Mcomp
Mcomp
Answer
(1)
Table 1-1:
Machine CPI MIPS rate
Mbase 2.8 178.57
Mopt 2.35 255.32
CPI for Mbase = 0.4 2 + 0.25 3 + 0.25 3 + 0.1 5 = 2.8
CPI for Mopt = 0.4 2 + 0.25 2 + 0.25 3 + 0.1 2 = 2.35
MIPS for M base = 500 / 2.8 = 178.57
MIPS for Mopt = 600 / 2.35 = 255.32
(2)
Table 1-2:
Machine CPI Execution time Mbase /Execution time Mcomp
Mcomp 2.81 1.12
CPI for Mcomp = (0.4 0.9 2 + 0.25 0.9 3 + 0.25 0.85 3 + 0.1 0.95 5) / (0.4 0.9 +
0.25 0.9 + 0.25 0.85 + 0.1 0.95) = 2.5075 / 0.8925 = 2.81
Execution Time for Mbase (IC 2.8) / 500M
= = 1.12
Execution Time for Mcomp (IC 0.8925 2.81) / 500M
2. Here is a loop in C:
Loop: g = g + A[i];
i = i + j;
if (i != h) goto Loop;
Assume that A is an array of 100 elements and the variables g, h, i, and j are in registers $s1,
$s2, $s3, and $s4, respectively, In addition, assume that the base of the array A is in $s5.
Other registers that can be used are $t0 and $t1. A possible MIPS assembly code
corresponding to this C loop is showed as follows:
Loop: add $t1, $s3, OPA
add $t1, $t1, $t1
add $t1, $t1, OPB
lw $t0, 0(OPC)
add OPD, $s1, $t0
add $s3, $s3, $s4
bne $s3, OPE, Loop
Please determine proper values for the operands (OPA, OPB, OPC, OPD, OPE). Copy the
Table 2 to your answer sheet and fill in the operand values.
Table 2:
Operand OPA OPB OPC OPD OPE
Value
Answer
Table 2:
Operand OPA OPB OPC OPD OPE
Value $s3 $s5 $t1 $s1 $s2
3. Figure 1 shows a 16-bit adder consisting of four 4-bit ALUs using carry lookahead. Assume
that gi = ai‧bi (generate) and p; = ai + bi (propagate). Determine the value of P0, P1, P2, P3,
G0, G1, G2, G3, and C4 (CarryOut) when adding two 16-bit numbers 0001101000110011
and 1110010111101011 with Carry In bit equal to 0. Please copy Table 3 to your answer
sheet and fill in the blanks with your answers.
Table 3:
Signal P0 P1 P2 P3 G0 G1 G2 G3 C4
Value
CarryIn
a0 CarryIn
b0 Result0--3
a1
b1 ALU0
a2 pi
b2 P0
G0 gi
a3
b3 Carry-lookahead unit
C1
ci + 1
a4 CarryIn
b4 Result4--7
a5
b5 ALU1
a6 P1 pi + 1
b6 G1 gi + 1
gi
a7
b7
C2
ci + 2
a8 CarryIn
b8 Result8--11
a9
b9 ALU2
a10 P2 pi + 2
b10 G2 gi + 2
a11
b11
C3
ci + 3
a12 CarryIn
b12 Result12--15
a13
b13 ALU3
a14 P3 pi + 3
b14 G3 gi + 3
a1
a15 C4
b15 ci + 4
CarryOut
Figure 1
Answer
Table 3:
Signal P0 P1 P2 P3 G0 G1 G2 G3 C4
Value 0 1 1 1 0 1 0 0 1
4. Compare the single-cycle implementation, in which all instructions take 1 clock cycle, with
the five-stage (IF, ID, EX, MEM, WB) pipelined implementation using the following eight
instructions: load word (lw), store word (sw), subtract (sub), and (and), or (or), set- less-than
(slt) and branch-on-equal (beq). The operation times for the major functional units are 2 ns for
memory access, 2 ns for ALU operation, and 1 ns for register file read or write.
(1) Please complete the Table 4-1 by filling in the time for each component to calculate the
total time of each instruction executed in the single-cycle implementation. Assume that
the multiplexors, control unit, PC accesses, and sign extension unit have no delay. Please
copy Table 4-1 to your answer sheet and fill in the blanks with your answers.
(2) What is the clock cycle time for the single-cycle implementation and the pipelined
implementation? Please copy Table 4-2 to your answer sheet and fill in the blanks with
your answers.
(3) Consider a program consisting of 100 lw instructions and in which each instruction is
dependent upon the instruction before it. What would the actual CPI be if the program
were run on single-cycle implementation and the pipelined implementation with a
forwarding unit and a hazard detection unit? Please copy Table 4-2 to your answer sheet
and fill in the blanks with your answers.
Table 4-1:
Instruction Register ALU Data Register Total
Instruction class
fetch read operation access write time
Load word (lw)
Store word (sw)
R-format
2ns 1 ns 2ns 1 ns 6 ns
(add, sub, and, or, slt)
Branch (beq)
Table 4-1:
Design Clock cycle time Actual CPI
Single-cycle implementation
Pipelined implementation
Answer
(1)
Table 4-1:
Instruction Register ALU Data Register Total
Instruction class
fetch read operation access write time
Load word (lw) 2ns 1 ns 2ns 2ns 1 ns 8 ns
Store word (sw) 2ns 1 ns 2ns 2ns 7 ns
R-format
2ns 1 ns 2ns 1 ns 6 ns
(add, sub, and, or, slt)
Branch (beq) 2ns 1 ns 2ns 5 ns
Table 4-2:
(2) (3)
Design Clock cycle time Actual CPI
Single-cycle implementation 8 ns 1
Pipelined implementation 2 ns 2
Remark: Actual CPI for pipeline = [(5 – 1) + 100 + 99] / 100 ≈ 2
5. Increasing associativity requires more comparators, as well as more tag bits per cache
block. Assuming a cache of 4K blocks, a four-word block size, and a 32-bit address, find
the total number of sets and the total number of tag bits for caches that are direct mapped,
two-way and four-way set associative, and fully associative. Please copy Table 5 to your
answer sheet and fill in
Table 5:
Cache The total number of sets The total number of tag bits
Direct mapped
Two-way set associative
Four-way set associative
Fully associative
Answer
Table 5:
Cache The total number of sets The total number of tag bits
Direct mapped 4K ( 32 – 12 – 4) 4K = 64K bits
Two-way set associative 4K / 2 = 2K ( 32 – 11 – 4) 4K = 68K bits
Four-way set associative 4K / 4 = 2K ( 32 – 10 – 4) 4K = 72K bits
Fully associative 1 ( 32 – 4) 4K = 112K bits
6. Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17, 19, 56,
11, 4, 43, 5, 6, 9, 17. Assuming a direct- mapped cache with four-word blocks and a total
size of 16 words that is initially empty, label each reference in the list as a hit or a miss
and show the final contents of the cache. Please copy Table 6-1 and Table 6-2 to your
answer sheet and fill in your answers.
Table 6-1:
Ref. 1 4 8 5 20 17 19 56 9 11 4 43 5 6 9 17
Hit/Miss
Table 6-2:
Block# 0 1 2 3
Starting Address
Answer
Table 6-1:
Ref. 1 4 8 5 20 17 19 56 9 11 4 43 5 6 9 17
Hit/Miss M M M H M M H M M H M M H H M H
Table 6-2:
Block# 0 1 2 3
Starting Address 16 4 8
Remark:
Word
address 1 4 8 5 20 17 19 56 9 11 4 43 5 6 9 17
Block
0 1 2 1 5 4 4 14 2 2 1 10 1 1 2 4
address
Tag 0 0 0 0 1 1 1 3 0 0 0 2 0 0 0 1
Index 0 1 2 1 1 0 0 2 2 2 1 2 1 1 2 0
Hit/Miss M M M H M M H M M H M M H H M H
Block# 0 1 2 3
Final content 16, 17 , 18, 19 4, 5 , 6 , 7 8, 9 ,10 ,11
7. Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the
primary cache, and a clock rate of 1 GHz. Assume a main memory access time of 100 ns,
including all the miss handling.
(1) Suppose the miss rate per instruction at the primary cache is 5%. What is the effective
CPI for the machine with one level of caching?
(2) If we add a secondary cache that has a 10- ns access time for either a hit or a miss and is
large enough to reduce the miss rate to main memory to 2%. What is the effective CPI for
the machine with the two- level cache?
Answer
(1) Effective CPI for one level cache = 1 + 0.05 100 = 6
(2) Effective CPI for two level cache = 1 + 0.05 10 + 0.02 100 = 3.5
8. Suppose you want to achieve a speedup of 50 with 100 processors. What fraction of the
original computation can be sequential?
Answer
1
Speedup = 50 = 𝑓 f = 0.9898 98.98%
+(1−𝑓)
100
102中興電機
1. To compare the performance of two different computers: M1 and M2. Table 1 shows the
measurements on these computers.
Table 1
Program Time on M1 Time on M2
Program 1 4 second 3 second
Program 2 10 second 20 second
Table 2 shows the additional measurements on these computers.
Table 2
Program Instructions executed on M1 Instructions executed on M2
Program 1 5 × 109 6 × 109
(a) If the clock rates of computers M1 and M2 are 4GHz and 6GHz, respectively, find the clock
cycles per instruction (CPI) for program 1 on both computers ?
(b) Assuming the CPI for program 2 on each computer is the same as the CPI for program 1,
find the instruction counts for program 2 running on each computer using the execution time
in Table 1 ?
Answer:
(a) CPIM1 = (4 4G) / (5 × 109 ) = 3.2; CPIM2 = (3 6G) / (6 × 109 ) = 3
(b) ICM1 = (10 4G) / 3.2 = 12.5 × 109 ; ICM2 = (20 6G) / 3 = 40 × 109
2. For a given instruction set architecture, please list three sources to increase in the CPU
performance?
Answer:
3. Convert the following MIPS codes to C codes. The parameter variable n corresponds to the
argument register $a0. The compiled program starts with the label of the procedure and then
saves two registers on the stack, the return address and $a0. The MIPS assembly codes are listed
as follows:
test:
addi $sp, $sp, -8
sw $ra, 4($sp)
sw $a0, 0($sp)
slti $t0, $a0, 1
beq $t0, $zero, L1
addi $v0, $zero, 1
addi $sp, $sp, 8
jr $ra
L1: addi $a0, $a0, -l
jal test
lw $a0, 0($sp)
lw $ra, 4($sp)
addi $sp, $sp, 8
mul $v0, $a0, $v0
jr $ra
Answer:
int test (int n)
{
if (n < 1) return (1);
else return (n*fact(n – 1));
}
4. Please illustrate and describe the PC-relative addressing mode in MIPS machines.
Answer:
4. PC-relative addressing
op rs rt Address Memory
PC Word
Answer:
ID/EX
WB
EX/MEM
MEM/WB
Control M WB
EX WB
IF/ID M
Add
Add
Shift Branch
RegWrite
left 2
ALUSrc
MemWrite
MemtoReg
Instruction
PC Read
Address Read
register 1 data 1
Read Zero
register 2 ALU
Instruction
Write
Read
Result Address Read
memory register
data 2 data
Data
Write Registers memory
data
Write
data
Instruction
[15-0] Sign ALU
extend control
Instruction MemRead
[20-16] ALUOp
Instruction
[15-11]
RegDst
Answer:
(a) At the end of the fourth cycle of execution, registers $3 and $7 are being read.
(b) At the end of the sixth cycle of execution, register $4 is being written and registers $6 and $1
are being read.
(c) Execute this program needs (5 – 1) + 6 = 10 clock cycles
Remark(c): there are data hazards in the given code but no forwarding scheme is used in the
datapath. Hence, the code cannot execute correctly.
7. Considering the following datapath shown below, draw paths (use color pens if possible) to show
the flow of both data and the program counter for the following instructions.
(a) add $t0, $t0, $t1 # single cycle datapath
(b) lw $a0, 0($t1) # multicycle datapath
Add
4 Add Sum
Shift PCSrc
left 2
Instruction [31-26]
Control
Instruction [5-0]
(a)
Outputs
Control
Op[5-0]
Jump
address
Instruction [5-0] 26 Shift 28 [31-0]
left 2
Instruction
[31-26] 8000 0180
PC[31-28]
PC Instruction Read
Address [25-21] register 1
Read A
Memory Instruction
[20-16] Read data 1
MemData register 2 ALU
Instruction ALUOut
[15-0] Instruction Write Read EPC
Write [5-0] Registerdata 2 B
data Instruction
Register Write
data Registers
Instruction
[15-0]
Cause
Memory 16 32 ALU
data Sign Shift
register extend left 2 control
Instruction [5-0]
(b)
Answer:
(a)
Add
4 Add Sum
Shift PCSrc
left 2
Instruction [31-26]
Control
Outputs
Control
Op[5-0]
Jump
address
Instruction [25-0] 26 Shift 28 [31-0]
left 2
Instruction
[31-26] PC [31-28]
PC
Instruction Read
Address [25-21] register 1
Read A
Memory Instruction Read
[20-16] data 1
register 2 ALU
MemData
Instruction Instruction Write Read
Write [15-0] [15-11] Register data 2 B
data Instruction
Register Write
data Registers
Instruction
[15-0]
Memory ALU
data Sign Shift
extend left 2 control
register 32
16
Answer: (d)
2. A compiler designer is trying to decide between two code sequences for a particular computer.
The hardware designers have supplied the following facts:
CPI for each instruction class
A B C
CPI 1 2 3
For a particular high- level language statement, the compiler writer is considering two code
sequences that require the following instruction counts:
Answer:
(a) Total clock cycle for code sequence 1 = 1 2 + 2 1 + 3 2 = 10
Total clock cycle for code sequence 2 = 1 4 + 2 1 + 3 1 = 9
Code sequence 1 is faster
(b) CPI for code sequence 1 = 10 / 5 = 2
CPI for code sequence 2 = 9 / 6 = 1.5
3. Convert 16-bit binary version of 210 and -210 to 32-bit binary number.
1111 1111 1111 1111 1111 1111 1111 11002
Answer:
210 = 0000 0000 0000 0000 0000 0000 0000 00102
-210 = 1111 1111 1111 1111 1111 1111 1111 11102
Remark: 題目給的二進位數為4
4. In ARM, suppose register r0 has the binary number
1111 1111 1111 1111 1111 1111 1111 11112
and that register r1 has the binary number
0000 0000 0000 0000 0000 0000 0000 00012
and the following instruction is executed.
CMP r0, r1
Which conditional branch is taken? Please explain your answer.
BLO L1 : unsigned branch (branch on lower)
BLT L2 : signed branch (branch on less than)
Answer:
Instruction BLO is not taken to L1
Instruction BLT is taken to L2
5. In ARM, the branch instruction is encoded using 24-bit address. You might think that the 24-bit
address would extend the program limit to 224 or 16MB, which would be fine for many programs
but constrain some large ones. How to solve this problem? Hint: specify a register that would
always be added to the branch address.
Answer:
We can specify the program counter register (PC) that would always be added to the branch
address, so that a branch instruction would calculate the following:
Branch target address = PC + branch address
This sum allows the program to be as large as 2 32 , solving the branch address size problem.
6. There are situations in pipelining when the next instruction cannot execute in the following clock
cycle. These events are called hazards. In the following situation, please point out what kind of
hazards is occurred.
Suppose you found a sock (短襪) at the folding station for which no match existed. One possible
strategy is to run down to your room and search through your clothes bureau (衣櫃) to see if you
can find the match. Obviously, while you are doing the search, loads that have completed drying
and are ready to fold and those that have finished washing and are ready to dry must wait.
Answer:
(a)
Instruction sequence
lw $1, 40($6)
add $6, $2, $2
Delay I3 to avoid RAW hazard on
a. nop
$1 and $6 from I1 and I2
nop
sw $6, 50($1)
lw $5, -16($5)
nop
Delay I2 to avoid RAW hazard on
b. nop
$5 from I1.
sw $5, -16($5)
add $5, $5, $5
(b)
Instruction sequence
lw $1, 40($6)
No RAW hazard on $1 from I1
a. add $6, $2, $2
(forwarded)
sw $6, 50($1)
lw $5, -16($5)
nop Delay I2 to avoid RAW hazard on
b.
sw $5, -16($5) $5 from I1
add $5, $5, $5
102 台科大電子
1. What is the power wall? How does it affect CPU development?
Answer:
Power Wall is the increasing power consumption and resulting heat generation of the
processing unit. The power wall can be mitigated by "shrinking" the processor by using
smaller traces for the same logic. This poses manufacturing, system design, and deployment
problems.
The power wall has forced a dramatic change in the design of microprocessor. Rather than
continuing to decrease the response time of a single program running on the single processor,
as currently all desktop and server companies are shipping microprocessors with
multiprocessors per chip, where benefit is often more on throughput than on response time.
Answer:
(a) Structural hazards: hardware cannot support the instructions executing in the same clock cycle
(limited resources)
(b) Data hazards: attempt to use item before it is ready (Data dependency)
Example:
1 lw $5, 50($2)
2 add $2, $5, $4
3 add $4, $2, $5
4 beq $8, $9, L1
5 sub $16, $17, $18
6 sw $5, 100($2)
7 L1:
Type Example
假設上列指令是在只有單一記憶體的 datapath 中執行,則在第 4 個時脈週
Structure 期,指令 1 讀取記憶體資料同時指令 4 也在從同一個記憶體中擷取指令,
hazard 也就是兩個指令同時對一記憶體進行存取。在這樣情形下就會發生
structural hazard
指令 1 在第 5 個時脈週期時才會將計算結果寫回暫存器$5,但指令 2 和 3
Data hazard 分別在時脈週期 3 跟 4 時便需要使用暫存器$5 的內容。此時指令 2 和 3 擷
取到暫存器$5 的內容並非最新的,因此在這樣情形下就會發生 data hazard
(c) Speculation: an approach whereby the compiler or processor guesses the outcome of an
instruction to remove it as a dependence in executing other instructions.
For example, the compiler can use speculation to move the lw instruction across sw, if we
guess the memory addresses for the two instructions are different.
sw $1, 100($2)
lw $3, 100($4)
3. Suppose we have a 32-bit virtual address, 4KB pages, and 4 bytes per page table entry. What
is the total page table size?
Answer
Number of entries in page table = 4GB / 4KB = 1M
The page table size = 4B 1M = 4MB
4. Please use CPU execution time formula to explain why superscalar and super pipeline could
improve performance.
Answer: CPU execution time = Instruction count CPI clock cycle time
Superscalar enables the processor to execute more than one instruction per clock cycle. This will
increase IPC (or reduce the average CPI) and thus can improve performance.
Superpipeline increases the depth of the pipeline (more pipeline stage) to overlap more
instructions. Performance is potentially greater since the clock cycle time can be shorter.
5. Please explain why the clock cycle/item in the right-hand side is not consistent with
instruction/item in the left-hand side?
1200 2000
800
1200
600
800
400
0 0
4 8 16 32 64 128 256 512 1024 2048 4096 4 8 16 32 64 128 256 512 1024 2048 4096
Size (K items to sort) Size (K items to sort)
Answer: Quicksort consistently has many fewer misses per item to be sorted.
Remark:
5
Radix sort
4
1
Quicksort
0
4 8 16 32 64 128 256 512 1024 2048 4096
Size (K items to sort)
102 台科大資工
1. A compiler designer is trying to decide between two code sequences for a particular
computer. The hardware designers have supplied the following facts:
CPI for instruction class
A B C
CPI 3 4 2
For a particular high- level- language statement, the compiler writer is considering two code
sequences that require the following instruction counts:
Code sequence Instruction counts for instruction class
A B C
1 2 2 6
2 2 5 1
(a) Which code sequence executes the most instructions?
(b) Which will be faster?
(c) What is the CPI for each sequence?
Suppose we measure the code for the same program from two different compilers and obtain
the following data:
Instruction counts (in billions) for instruction class
Code form
A B C
Complier 1 1 2 7
Compiler 2 2 1 5
Assume that the computer's clock rate is 5 GHz.
(d) Which code sequence will execute faster according to execution time?
(e) Which code sequence will execute faster according to MIPS?
Answer:
5 10
(2 1 5) 109
MIPS for compiler 2 = = 2000
4 106
According to MIPS, code sequence from compiler 1 is as fast as from compiler 2.
Answer:
Cache1: number of blocks = 4
ID.Flush
A
ID/EX
EX/MEM
Control M MEM/WB
Cause
IF/ID M
EPC
Registers C
ALU
Instruction
80000180 PC memory Data
memory
IF.Flush
(a) Please make the right binding between each of units "A", "B", "C", "D", "E" and one of
following names.
(1) AND array (2) Hazard detection unit (3) Forwarding unit
(4) XOR array (5) Shift left 2 unit (6) Sign extend unit
(7) MUX
(b) For the following instruction sequence and assuming that branch (B) is taken. Please
describe the instruction pair that causes data hazard, the corresponding depend register,
and the unit required to solve the data hazard.
(S) sub $5, $3, $4 # Reg5 = Reg3 - Reg4
(A) add $1, $4, $5 # Reg1 = Reg4 + Reg5
(D) and $3, $2, $5 # Reg3 = (Reg2 OR Reg5)
(L) lw $7, 0($1) # Load from MEM (compute base on $1) to $7
(O) ori $9, $3, $7 # Reg9 = (Reg3 OR Reg7)
(B) beq $7, $9, loop # Branch to loop if Reg7 = Reg9
(R) sra $3, $9, 2 # Reg3 = Reg9 >> 2
(T) slt $9, $3, S7 # Reg9 = 1 if Reg3 < Reg7
(I) loop: addi $9, $7, 40 # Reg9 = Reg7 + 40
(c) Continue with (b). How many cycles does it take to execute the above instruction sequence?
Answer:
(a)
A B C D E
(2) (3) (4) (5) (6)
Hazard Forwarding XOR array Shift left 2 Sign extend
detection unit unit unit unit
(b)
Instruction pair Function
Depend
that cause data units
register
hazard required
(S & A) $5 B
(S & D) $5 B
(A & L) $1 B
(D & O) $3 B
(L & O) $7 A, B
(L & B) $7 A
(O & B) $9 A
(O & R) $9 B
(R & T) $3 B
Remark: Since the datapath cannot forward any data to ID stage, forwarding unit (B) is
not needed for (O & B)
(c) Instructions between (L & O) and (O & B) required one clock delay for data hazard,
respectively.
Branch is taken imply the instruction (R) is fetched into the pipeline and will be flushed and
also cause one clock delay. Hence, the total clock cycles required to execute the code is (5 – 1)
+ 7 + 3 = 14 clocks
102 台師大資工
1. Consider a program written in high level programming language is executed separately on
two processors with different instruction set architectures. Processor A executes 10 billion
instructions with clock rate 4GHz and CPI 1.0. Processor B executes 8 billion instructions
with the same clock rate but CPI 1.1.
(a) Find the MIPS rates of those processors.
(b) Find the total execution time for each of the processors.
(c) Compare their performance.
Answer
4G 4G
(a) MIPSA = = 4000, MIPSB = = 3640
1.0× 106 1.1× 106
10× 109 × 1
(b) Execution Time for processor A = = 2.5
4× 109
8× 109 × 1.1
Execution for processor B = = 2.2
4 × 109
2. Assume a decimal value -1234 is stored in a 32-bit general purpose register. What is the value
(in binary or hexadecimal format) stored in the register if it is represented as
(a) a signed integer, and
(b) a single-precision floating-point number in the IEEE 754 standard.
Answer: 123410 = 0000 0000 0000 0000 0000 0100 1101 00102
(a) -123410 = 1111 1111 1111 1111 1111 1011 0010 11102 = FFFFFB2C16
(b) -0000 0000 0000 0000 0000 0100 1101 00102 = -1.0011010010 210
S E F
binary 1 01110101 00110100100000000000000
hexadecimal BA9A4000
3. Consider a computer datapath divides the execution of an instruction into five stages, i.e.
instruction fetch, instruction decode, execution, memory access, and write back. The
processing times of these stages are 400ps (10 -12 seconds), 200ps, 120ps, 350ps, and 200ps,
respectively,
(a) What is the highest operating frequency if it is a single cycle (not pipelined) datapath?
(b) A pipeline register adds 20ps delay. What is the highest operating frequency if it is a
five-stage pipeline datapath?
(c) Assume an infinite code sequence is executed, and no hazard is found between
instructions. Compare the performances of the single cycle datapath and the pipelined
datapath.
Answer
(a) 1 / (400 ps + 200 ps + 120 ps + 350 ps + 200 ps) = 787.4MHz
(b) 1 / 420 ps = 2.38 GHz
(c) Pipeline datapath is (400 + 200 + 120 + 350 + 200) / 420 = 3.02 times faster than single cycle
datapath
4. Consider the code sequence executing in a five-stage (IF, DE, EX, MEM, and WB) pipelined
datapath.
lw $1, 0($6) # $1 MEM[$6]
lw $2, 1($6) # $2 MEM [$6 + 1]
add $3, $1, $2 # $3 $1 + $2
sub $4, $3, $5 # $4 $3 + $5
beq $3, $0, L1 # jump to L1 if $3 == $0
….
L1:
(a) Show all the possible hazards between instructions.
(b) Describe how the hazards can be resolved in recent computer architectures.
Answer
(a)
Data hazard
Register Instruction pair
$1 (1, 3)
$2 (2, 3)
$3 (3, 4), (3, 5)
(b) Data hazard can be resolved by forwarding technique in recent computer architectures.
However, forwarding technique cannot save the day is when an instruction tries to read a
register following a load instruction that writes the same register. Hence, the pipeline need a
stall between the load and its use.
5. Consider a 128 bytes fully associative mapped cache with 4-word blocks (a word has 32 bits).
Its replacement policy is LRU. Suppose the following memory word addresses (in decimal)
are accessed in sequence:
40, 41, 42, 400, 43, 400, 60, 61, 62, 63, 64, 800, 40, 41, 42, 800, 43, 44, 60, 61, 62, 120, 121,
122, 123, 168, 169, 41, 42, and 400.
(a) Show whether it is a hit or a miss in the cache for each of the memory accesses.
(b) What is the miss rate?
128
Answer: There = 8 blocks in the cache.
4× 4
(a)
Word address Block address Hit/Miss
40 10 Miss
41 10 Hit
42 10 Hit
400 100 Miss
43 10 Hit
400 100 Hit
60 15 Miss
61 15 Hit
62 15 Hit
63 15 Hit
64 16 Miss
800 200 Miss
40 10 Hit
41 10 Hit
42 10 Hit
800 200 Hit
43 10 Hit
44 11 Miss
60 15 Hit
61 15 Hit
62 15 Hit
120 30 Miss
121 30 Hit
122 30 Hit
123 30 Hit
168 42 Miss
169 42 Hit
41 10 Miss
42 10 Hit
400 100 Miss
(b) Miss rate 10/30 = 1/3
102海洋電機
1. Draw a schematic diagram for a 4-bit adder, using 1-bit full-adder as a basic unit
Answer
a3 b3 a2 b2 a1 b1 a0 b0
c4 + + + + c0
s3 s2 s1 s0
2. Explain sequential logic and combinational logic. Give one example for each class.
Answer
Sequential logic: sequential logic is a type of logic circuit whose output depends not only on the
present value of its input signals but on the past history of its inputs. That is, sequential logic
has state (memory).
A common example of a circuit employing sequential logic is the flip- flop.
Combinational logic: The elements that operate on data values are all combinational, which
means that their outputs dependent only on the current inputs. Given the same input, a
combinational element always produces the same output.
The 4-bit ripple carry adder is an example of a combinational; element.
Answer
Answer
(1) First Fit - A resource allocation scheme (usually for memory). First Fit fits data into memory
by scanning from the beginning of available memory to the end, until the first free space which
is at least big enough to accept the data is found. This space is then allocated to the data.
(2) A resource allocation scheme (usually for memory). Best Fit tries to determine the best place to
put the new data. One example might be to try and minimize the wasted space at the end of the
block being allocated - i.e. use the smallest space which is big enough.
Remark: This topic belongs to OS
5. Explain the four conditions that must hold simultaneously in a system for causing a deadlock
situation.
Answer
(1) Mutual Exclusion Condition: At least one resource must be held in a non-shareable mode, that
is, only one process at a time claims exclusive control of the resource. If another process
requests that resource, the requesting process must be delayed until the resource has been
released.
(2) Hold and Wait Condition: There must exist a process that is holding a resource already
allocated to it while waiting for additional resource that are currently being held by other
processes.
(3) No-Preemptive Condition: Resources cannot be removed from the processes are used to
completion or released voluntarily by the process holding it.
(4) Circular Wait Condition: The processes in the system form a circular list or chain where each
process in the list is waiting for a resource held by the next process in the list.
Remark: This topic belongs to OS
Answer
Answer
(a) The south bridge handles all of a computer's I/O functions, such as USB, the system BIOS,
and the interrupt controller.
The north bridge typically handles communications among the CPU, in some cases RAM,
and PCI Express (or AGP) video cards, and the south bridge.
(b)
NOR flash Nand flash
名稱由來 Bit cell like a NOR gate Bit cell like a NAND gate
存取方式 Random read/write access Block-at-a-time access
價格 More expensive Cheaper
用途 Instruction Mem. in embedded systems USB keys, media storage, ...
(c) A cache write miss refers to a failed attempt to write a piece of data in the cache, which
results in a main memory access with much longer latency. There are two ways to handle
write miss.
Write allocate: the block is fetched from memory and then the appropriate portion of the
block is overwritten in the cache.
No write allocate: Update the portion of the block in memory but not put it in the cache.
(d) 500 0.98 (10 + 2) + 500 0.02 (4 + 55 + 2 + 10) = 6590ns
(e) Speedup = (7 + 8 + 4 + 4 + 6 + 5) / 8 = 4.25
(f) Function of F1: if (A = B) then F1 = 1 else F1 = 0;
Function of F2: if (A > B) then F2 = 1 else F2 = 0
8
2
1 3
7
6
4 5
Answer
No. 6 No. 7 No. 8 No. 9
(a)
2 0.00010001 4 000000001
No. 1 No. 2 No. 3 No. 4 No. 5
(b)
1 0 1 0 0
3. Datapath問題
Add
4 Add Sum
Shift PCSrc
left 2
Instruction [31-26]
Control
1
2
Instruction [25-21] Read
PC Read
address
register 1
Read
6
Instruction [20-16] data 1
Read
Instruction register 2
[31-0] 3 Write Read
ALU
Address Read
data
Instruction
memory
4 Register data 2
Data
Write memory
Instruction [15-11] data Registers 7
Write
data
Instruction [15-0] 16 32
Sign
ALU
extend
5 control
Instruction [5-0]
8
Instr. Data
Reg. Reg. Reg. Reg. Instr. Mem. Data Mem.
Mem. Mem.
number content number content content content
Addr. Addr.
00000 800h 10000 0 100h lw $t 9, 10($t1) 200h 0
00001 200h 10001 0 104h lw $t 10, 14($t1) 204h 0
00010 11111111h 10010 0 108h add $t3, $t9, $t10 208h 0
00011 50h 10011 0 10ch sw $t3, 4($t1) 20ch 0
00100 0 10100 0 110h beq $t3, $t4 100 210h 55555555h
00101 0 10101 0 114h 214h aaaaaaaah
00110 0 10110 0 118h 218h 0
00111 0 10111 0
01000 100h 11000 0
01001 0 11001 0
01010 0 11010 0
01011 0 11011 0
01100 0 11100 0
01101 0 11101 0
01110 0 11110 0
01111 0 11111 0
注意:
(1) PC內容為100h
(2) 執行指令lw $t9, 10($t1),其中$t9以01001表示,$t1以00001表示,10為16進制
(3) sw指令說明同(2)
問題:
(a) 執行第一個指令時(lw),上圖中箭號所指編號1, 2, 3, 4, 5, 6, 7, 8處之值為何?(5, 6, 7,
8請以16進制表示)
(b) 執行第三個指令時(add),上圖中箭號所指編號1, 2, 3, 4, 5, 6, 7, 8處之值為何?
(c) 所有指令執行完畢後,PC內容為何?請以16進制表示。
Answer