Memory Hierarchy Design Overview
Memory Hierarchy Design Overview
Oct 17:
Don't use those as the activation, use those as the latency when you're drawing
pipeline.
So, like, store's memory happens later, so they'll need the extra latency from
store to, lowest to store, or, sorry, store to, like, add. No, no, load to add.
Because the load finishes at the end of the second memory stage. They say execute
memory. Draw a pipeline of the next thing, which would be an add. Then when does
the add meet the value of the load? That's just the latency between.
floating load and floating down. Then you have to do, floating load and floating
star, and then forget it's another… Oh, no, no, no.
Okay, I think that might have been, like… That makes sense.
I already submitted it, I'm not redo it. I'm just taking it out. I think I got
partway there, like… how many times have you unrolled this? Twice. Just twice, and
you were able to get rid of all the solids? Yeah.
Funny.
And I thought you were the… I, like, I had to roll out 3 times to get it to work.
Oh, I eventually had it 3 times.
And are we… we're not reordering the instructions. Oh, that's the whole point.
That's the whole thing.
It's still some benefit.
Yeah, because I didn't, I didn't… Good afternoon!
Well, we are actually in the textbook. We are chapter 2, right?
I'm covering Appendix B until…
So, last time we talked about block placement and identification. For given memory
address, you should be able to partition the what is the field for block offset,
and the cache index, and tag. So, we've done…
That, with the only example of a directly mapped cache, so it's more set-associated
fully is coming.
more later, and then let's… finishing up the, Appendix B. The last one, we need to
see the black…
Block placement and identification, replacement and
right policy. So do we need any policy if we have only one choice?
No, right? There aren't many, so we will look at that.
That you switch things around after enrollment. Well, you switch things around…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This set of slaves.
Kim, Eun J
Kim, Eun J
[Link]
Before starting today's lecture, do you have any questions?
So,
the… the… your… your one-off homework question, instead of, giving table, I gave,
latency, right? And then the… it's pipelined. Then how do you, interpret as a
number of stores between? Then the… you want to draw, for example.
Okay, I don't have a clear memory, but my guess is this way.
let's say… Come on. Okay.
So, let's say memory is 2, then memory address calculation is 1. So, when you have
a load, you will have a load, then batch.
Decode, because it says pipeline, right? Decode, and then execution. Execution, you
have address calculation, and then memory, let's say, is a 2 memory access time.
Then you will update
register here. So let's say you are having load F0, whatever, whatever, and then
the other ALU, let's say FADD, use F0 here, you have a dependency. So what about
this AED?
Then it'll… it is a pipeline, it says it's pipeline, so it will be fetched here.
decode, and then try to execute at this moment, but your memory data is not there,
right? So it'll be too slow, and you will start executing here.
Okay.
All right, so any… any latency, you would interpret this way. You would, like,
let's say this ad takes a 4 cycle, or a 3 cycle, you have a 4, or a 3, depending on
the number, then let's say there is a multiplication, use, the… this resource.
F2, then you will have a fetch.
Do you feel?
And then execution will be delayed, delayed, delayed, like, something like that. So
you should be able to draw this way.
Good news, I don't have this in your meeting, I already finalized meetup question.
If we have two loads, is there a structural hazard of the second load, because
they're both occupying the memory page?
That is a very good question. It's related to this chapter, yes. So we will talk
about the cache design, whether we have a non-blocking cache or a blocking cache.
If it's blocking cache, yes, you will have… you will wait outside of memory. So
there will be dynamic delay occur in memory.
Oh, that's the problem. But with this problem, don't worry about structure. Okay,
there's no structure. Yeah, we don't worry about it, okay.
Okay. That's a good one. Okay.
Okay. Because you never learned memory yet, okay? So your memory is ideal.
Ideal means if even you have something going on, you can accept another one, okay?
Any questions?
Alright.
Review all the problems, including Homer, okay?
I told you.
Former, too. It not only Chris.
Let's go.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
lies?
Let me give you the view on cash replacement policies and Y policies.
Kim, Eun J
Kim, Eun J
[Link]
Before that, it's already answered. So, what kind of replacement policy do you
know?
You see here?
So, when we need a policy, there are multiple candidates, right?
So you learned last time, directly mapped cache. What is directly mapped cache?
You have an address, and then divide, and then you have index, right? And then go
to one row.
one row, there is one spot, okay? Oh, you… there is something, and then you do the
tech matching, and then it's not matched. So what does it mean? It's a conflict
miss, right? There is something, but it's a… we are sharing another blog is there.
Replacement happening, isn't it?
Do you need a policy there?
Indirectly map.
You have one candidate, this should be kicked out. Okay.
So, where we need the policy for the replacement, you have options.
You have options, so at least it should be two-set, two-way associated, four-way,
eight-way, fully, okay? There are flexibility, okay?
So when you have a flexibility, let's say.
Your cash is a 60… I will make up…
Question quickly, okay? Total cache size 64 byte. Your…
your cache block size is A byte, then it is two-way. Okay.
So tell me how you partition your address.
How many bits do you need for blah? Offset?
It's 8 byte means it's byte addressable, okay? Your memory is not bit addressable.
The bit means you need to multiply 8 again, 64 bits, right? You don't…
address each byte, each bit. It would be each byte. So, 8 byte means 3, right? 3 as
offset, and then 2A.
How do you do?
Just meet the button before.
So, how do you… okay, you… you would divide the 64 total size.
with the block size, which gives you how many blocks you can have in this cache,
right? So how many blocks? You have 8 blocks.
But it says two-way, okay?
2A means with one index, you have two options, okay? So how I visualize, I usually
visualize this… okay, you come up with 8 blocks, right? So if it is one table
directly mapped, you have
8.
Is it correct? So, 8-roll table. Now, it's a two-way, you cut it half.
you put them in parallel, okay? Then how many entries in each table is heaven,
okay? So you need only 2 bits, right? Can you see that? Instead of 3 bits.
So you are in that field, is shrink as… Number of ways increase.
Okay?
So here, here, this is what this replacement policy is. Okay, I have address 01,
and I go there 0, 1, and there is a tag saved, and then I do tag match with both of
them. None of them match.
Which means it's a miss, right? Conflict miss, and then you go… you need to go to
memory, you bring a blah, and then where do you want to put? That is policy, right?
You want to have a policy. Which one? Among these two, which one you want to kick
it out?
If you know the future, a lot of times, in your term project, if you selected cash
replacement policy, and then Homo 4, also, you study about cache replacement
policy.
Open time, we compare against the genie. Genie means Something can see the future.
Okay, so you know the future reference string, and then you know for sure it won't
be accessed for a while, then you work it out. But we don't have a genie, right, in
our life. So what we do, based on history, right? So what we do with the LRW, least
recently used, why it works?
In your interview, replacement policy, this is the default policy you would mention
first.
Don't go to RRIP, your homework for, okay? Not many people read the ISCA paper, but
you can educate them. If you have an interview, you can talk… you can start with
the LRU. Why? LRU work? Tell me.
Temporal locality. Because of temporal locality, okay? So, cache, even in 10, 20
years, 30 years, you forgot everything. You should remember cache works because of
locality.
Right?
just you use, will have a higher chance to be used again. So, if you kick out least
recently used one, that is the best policy. Okay, this, just to give some fun
experiment, instead of a LRU, you do randomly kick out, okay? Random.
Because you don't… ARU, think about it, in… when we share the idea, LRU sounds so
simple.
least recently used.
Then, think about it, what kind of things you should have to implement in hardware?
What is it?
Please, re… you need to save?
That's, that's, that's just… I want to hear a different way of implementation. We
have,
Different… okay, it's, less than 40 students.
At least more than two different options I showed here.
So, you said there's something… Bye.
Just, like, record the access time. You, okay, record access time. Okay, the time,
right? Time. Have you looked at… have you ever printed out time in your system?
Okay, counter time, how many bits will be enough?
Right? Yeah. So then, okay, let's say we save excess time. The least recently used
one means either you maintain it as ascending order.
Right? All the other way.
And then you pick small list number, right?
See, every time then you have a new access, then it will be in edit entail. It's
cute, right?
And the first one will be kick-out. Can you see that? It's additional structure. In
addition to this, they…
the cache structure hardware, you need to have additional queue structure, keep
track of arrival time. But if you keep track of the sorted way, do you really need
to save access time?
No. Can you see that?
So there are two ways. You can simply, you know, associate with each block, you
have another field of access time.
Then, of this time, in the word time means it's infinite, okay? In hardware, if we
use 16, then there is a turn around, right? Every 17, we… it gets to 0, right?
So how many beads will be enough? That's a big study. It's a theoretical study. So
you need to… you should have a finite size of a counter, and you can compare, okay?
Or you can have, if two is easier, you can have a flag, right?
whichever reference that you set to 1 and 0, right? Then you know which one has
been used just now, before, right? But what if it's a 3?
Then you, you want a 4, okay, now there is no 3. 4. Then for each one, oh, you will
order them, so you can order them without Q. How about that? Index, right?
So So let's say you have 4-way.
Okay, so this is similar to the way you understand RIP will be helpful. So let's
say, at the beginning, we have empty. Okay, empty. One wrote, so then you said that
this is, just…
Just to reference the first one, and then you put the number 3 book, okay?
Justice. And then you have a new one, let's say a new one come, which one pick…
kick it out? This…
Right? Big is the number, RIU.
Isn't it? Then what happened?
This should… the newly arrived one should have index 1, right? Counter 1. Then
others, you need to do what? Add 1, 2, 3, 4. Can you see that?
this all the N operation.
Okay, so having exact LRU implemented in hardware is not simple. That's why there
was a paper in pseudo-LRU, RLIP, everything coming, okay?
All right
So, because this is complicated, people come up with, oh, then what about I just
randomly choose and then replace?
And then, actually, it wasn't that bad, okay? Look at this. They change. So your
homework 4, you are asked to compare performance, right? You need to change from
Two-way, four-way, eight-way, and the size difference, so you… at least you have a
different sensitivity study. But then if you see this larger cache and the
highways.
you don't see much difference. Okay, then…
What's the benefit of a render?
What's the problem of RLU?
If the performance is similar.
You need more hardware for our… Yes, very good. And what else?
Okay, I will tell you. The CPU is here.
Memory.
Every miss, you have a sum.
Traffic going on.
People can't white.
Step, wire, listen, and then seal the information.
RLU, if… because we know RLU works mostly, most places, so you look at the pattern,
and then you can guess what kind of access is going on like that. So you reveal
some of the information, because this is a very well-known policy. Random, you
don't know, right?
Okay?
Anyway, you can consider that. This is interesting things.
The security become more… more important. It's timely.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
What'll happen when you have a miss? Which means you don't have a block you're
looking for in the cache.
Then, you're gonna bring that block from memory.
So then, replacement happens.
Kim, Eun J
Kim, Eun J
[Link]
Let me, skip all the discussion, because I think I already talked about it.
Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Vendor.
Now, let's think about how to implement LRU policy in hardware.
Let's say you have 4-way.
Then every time reference happens to each way, you need to record, and then you
need to have total ordering of the reference time.
So, anytime reference, you need to have sort, right? So, most recently one becomes…
Kim, Eun J
Kim, Eun J
[Link]
We also even talk about it, right? So, let's talk about what happened on the right.
Remember the example we went over? Do you remember? The example only have read. Can
you see that?
I… I gave all the memory of this for me.
And what happened to Wright?
So, do you agree, when we cash.
We don't move data to the cache.
We use terminology move, intel, you know, move, but actually it's not move, what is
it?
It's a copy, isn't it?
So you have original copy in the memory, you have a duplicate copy in the cache for
fast access.
So what do you do with the right?
Change the origin.
You have to update the original. You can do two things. First.
Don't bother updating the memory, just update the cache. So you just update the
cache. If it's hit, if you find the data in the cache, you only update in the
cache. Then what happens?
At some point. So when it is kicked out, you will, right.
bad, right? So we call it right back.
So you need… do you agree you need to have an extra flag to denote whether it has
been updated once it brought to the cache. The other?
Write through. Write through means when you write to the cache, you also write
through memory. So always consistent, right, coherent between cache and memory. So
anytime you have a cache system crash, it's easier to recover, right? Okay, we are
done with the… also, right, policy. So let me go to the question.
Okay, so here…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
replaced when it writes back, okay? When it is replaced, that time, memory data
will be updated, so we call it write back.
Kim, Eun J
Kim, Eun J
[Link]
B.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Okay? So, with two policies, we can think which one is better in different ways,
okay? First, which one will be better for debugging? Okay? Write-through always
provides a coherent view between cached data and memory, right? So, if you start in
the middle, and you have the same copy with memory, and memory has the same copy
with the cache.
But if you have write-back, then there is a correspondency between… inconsistency
between cached block and memory block. It's not easy. Okay? And what will happen
when you have readmiss?
When you have a readmiss, again, in… with the earlier slide, what we discussed,
when you have a miss, you need to go to memory, you brought a block, right? Okay,
so you brought a new block, think about it. Let's say it is directly mapped cache.
You need to put that, just the broad block into a spot, but then that block, okay,
the existing block, there are two
ways, following the different policy, right? If it is right through, what you can
do?
That existing one is the same as a memory copy, so what do you do? You just
overwrite, right? It doesn't produce a write to the memory. However, when you have
a write back, means your cached block existing.
Kim, Eun J
Kim, Eun J
[Link]
Huh?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It's different from memory blocks, so…
You need to copy that cached block to memory, so there is a write back to the
memory happens, so it is yes, right?
You got it? When you have a write-through, because there is always the same copy,
you just use that spot, freely write with the new one. However, write back, you
need to wait until that block is safely write back to the memory, then you can use
that spot.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so let me rephrase question.
It says to read Mrs. Produce Rice? Instead, let me ask it differently.
When you have a replacement, okay, the kick-out blah.
Does it require any, you know, write to the memory?
depends… So, first, with the right-through, what happened. So, you have a cache,
okay? And then you check… let's say it's a directory map.
With the cash index, you go there, there is something, okay, there is something,
but it's not yours. Packer is not matching, so you brought new block, okay? Then
what happened? When you replace first, right through means what?
just replace. You just… can you imagine? This, existing one always
has the same copy in the memory. Do you agree? Right through, always you have the
same image on the memory. So when you bring new block, replacement, when you write
to the new block to the place in the cache, you can just overwrite. Can you see
that?
How about right back?
When you brought to new block, the place, you know, you need to put, there is a
block already, but it says the block has been updated. There is a dirty block,
dirty bit was 1.
Which means this blog is different from…
memory copy. So what do you need to do? You cannot just overwrite, right? You wait
until this will carry to memory, copy to the memory, and then you release the space
for new block. Can you see that?
So it's the way we handle… this is different, replacement.
Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The last question you need to ask is what happened with the repeated rice?
So you have a blog, okay, in the cache?
When you have… right through. Every time you're riding to the same block, but
different location, maybe.
Since it's a write-through, you all the time rise through to the memory, right?
So it will be repeated, right, to the memory.
on the same block. However, right back, for those rides, multiple rides on the same
block, you only record in the cache, only when it is a kick out, you will
restore, you will write back to the memory. That's the difference.
Kim, Eun J
Kim, Eun J
[Link]
So, here.
Let's say for the same memory address, somehow you're… when you write down the
sequence of operations in terms of memory, you see this kind of things happening.
So, it was a write, and a read, read, write, read, read, read, write, like that,
okay?
So you have another right?
So, with a ride through each ride, what do you do?
you update… Memory, right?
And then, how about this one?
You have a ride, right, and then hit, hit, hit, it's say in the cache, so…
You don't have any transaction to the memory until…
this block kicks out, right? You solve everything in this cache level.
Okay, so when you have a uniprocessor, only one CPU exists, there is no problem
with both of them. Can you see that?
So next, next chapter, we will be deal about cache coherence in multiprocessor
systems, two multiprocessor, like this.
Right? We have everywhere is a microprocessor, and then you have a thread to share
the variables between different cores. You need to keep the coherent view to your
cache, right?
So, which one will be easy to maintain coherent view to a different course, right
through or right back?
Right through.
Right?
always, what do you do? You, whenever you update, you do, right, through to the
memory, and then what if other copies in other
nodes, CPU, When you update, the copy that other people has is no longer valid,
right? You invalidate.
Okay?
So, we will talk about it.
How about if, let's say, this… Start with read.
No, no, it starts with the right.
There are two choices, whether you are locate.
Or locate means when you have a right miss, let's say this is… this right is F,
because it's first time, it's miss.
There are two choices. A locate means you go to memory, you bring that block to the
cache.
That is allocate. And then… Nano locate.
You don't have to bring. Why?
Think about cash through, right through.
Anyway, you need to go to memory, right?
So you just update there.
Which one?
Works better.
Moving stage one.
Same sequence, of course, yeah, or for the same sequence, which one?
If it is right 3, you need to allocate, right?
Right? So, okay, so… so, okay, okay, so with this same sequence of access pattern.
If you have right through, will you allocate or non-allocate? Allocate, because you
have reads to the same blah.
Okay, that's the reason, okay? But occasionally, if you have only right, only
Nono-locate. Tell me why non-locate can be better.
Before reading, so you have the saving cycle.
You are not reading.
You are not saving cycles, you are saving something.
Space. Space! So if you don't cash.
You have limited space in the cache, right?
So… See, let's say you know this data I only use for write.
Okay? And then, if you are… you already know your policy of a ride is a ride
through.
Which means every time I ride, I need to get up and go to the memory.
Is true or not, right?
Will you allocate… will you use one space for this block or not?
Either you have it or not, you need to go to memory. Is it the way I describe it
correct, right?
That's why, if it's occasionally write-only, with write-through, maybe you won't
allocate. You will just leave it in memory, because anyway, I need to go to memory.
Only with miss, you allocate. Allocate miss, you need to bring to the cache, okay?
So there is a different policy on this. And then another policy, I think, for your
homework for, you should be aware, is inclusive and non-inclusive between different
hierarchy, okay? I will leave up to you.
For your homework.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
One thing I want…
You to think about is, what are you going to do with the first right myth?
So, you have a miss, but you know that it's for store, right? You don't have cache,
right? Of course, you need to go to memory. Think about, right, through policy,
right? Through…
Kim, Eun J
Kim, Eun J
[Link]
people.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It is useful, right?
Even you brought that block from memory to the cache, what you need to do when you
write, you also need to write to the memory, right?
So, that is what we call whether you want to allocate, you want to bring the missed
block to the cache when you have a right miss or not. So, if you bring the block
for write, it's allocate. You put that in the cache. If not, we call it non-
allocate.
So…
think about it, you have four different combinations, right? So, in terms of a
locate and non-allocate, you have two choices. Independent of that, you have two
different write policies, write-through, write back.
So, when you have write, non-allocate, We drive through.
Even if you have a right miss, you go there to… instead of bringing that block, you
will write there. You won't allocate that block into your cage.
It makes more sense, right?
However, think about why you want to locate it, why you want to bring a block for
right-miss, even with rise through.
Is there any reason we can do that?
Another way, before I answer that question, think about this. You are having write-
back policy, right? Write-back policy. So, you go there to bring Mr. Block, and
then you brought to the cache.
Right? And then you are riding.
So if you have a repetitive writing to the same block, write that is better, isn't
it? And it would be better to allocate.
The replacement happens, you… Right back to the memory.
Early example, what I asked, when you have a rice through, you brought it, okay,
will you bring to the cache or not? Allocate or non-allocate?
write-through? Anyway, you need to go to memory to update every write, so it makes
sense more with the non-allocate, right?
So remember, when we talk about rights policies, you need to also
Think additional options on a location.
So, with the rise through, every rice, you need to rice to…
Kim, Eun J
Kim, Eun J
[Link]
A lot of time, too.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
memory, right?
Kim, Eun J
Kim, Eun J
[Link]
A lot of times, to make your life simple, open time, I assume you have a write-back
policy with a locate.
Okay, so then, when you have read-write accesses, the way readmiss, write miss,
Works? Same, isn't it?
When you have a written miss, you will bring. When you have a right mix, you will
bring, right?
And then, right back.
when you, like, when you have a hit on read, you only read from the cache, right?
There is no memory transaction. If it is right back, but you have a…
hit on write, what do you do? You just write on the cache. It's exactly the same.
So to make a simple scenario, both read and write, to behave the same, I open time
to describe, like, I have this policy, write back with a locate.
it will be same as read… the way we handle readmisss, okay? You all locate, and
then consecutive, hits, you will only access the CAN. And then change it in the
CAN.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So it can be a bottleneck of a CPU actions, but you don't want to wait right
through is done to low-level memory. So what can you do? You can put right buffer,
as it is.
Shown in here
So then CPU just can put the request in the buffer, go back to the work it has to
do.
And the reason we have a regis… not a register, but buffer, buffer means a series
of registers. You may have adversity rights.
By many, many rides.
Because the memory is much slower than CPU speed, so CPU can have multiple write
requests while memory serves, low-level memory serves one write. So you want to
have write buffer instead of one register.
And what are you going to do with the read after write hazards, right? So you are
writing to the memory, and while you are writing, and you are reading the same
catch block, so in order to keep read-write hazards, what we need to do? We need to
drain write buffer first, which means we update memory so that when new read comes,
the read will read updated value.
Okay? So that's so what we will revisit when we have a two-mode processor system,
okay?
Kim, Eun J
Kim, Eun J
[Link]
So, read after-write, okay, if you share a buffer, buffer is a P4, right? So, you
are having read after-write. As long as you keep that order in the buffer, it's
fine. Do you see? Do you agree? It's not a problem.
What about if we separate read buffer and write buffer?
First of all, what is motivation to separate
read and write buffer to the memory. Do you agree? We want to have a buffer between
processor and memory to accommodate different speed of CPU and memory?
Missing my phone?
Okay, so…
Open time, your memory, number of clocks, like, 20 to 100 times slower than CPU
clock cycle time.
So, CPU, generate, load, store requests, memory, read, write requests a lot, while
one is serving, read by memory. Can you see that? You need to have a queue.
To hold the outstanding request.
So what I ask… Shall we?
Decorporated two different cues, one for read, one for write.
Alright, so in traffic light, there is one lane versus two lane. Which one do you
prefer?
Two lane. And then, good design, maybe, you, you make a turn, left, right turn, you
will be in the right side lane, and then if you go to straight, you will be in the
left lane, right? What do we do? We separate lines, right?
So, when we are having two buffers for one reader and one ride.
Which means we are breaking the order, incoming order, isn't it?
Okay?
What is the motivation of breaking the order?
There's no dependency, right? Also, that would dependency pass-through now.
Okay, dependency, in order to, you know, preserve a dependency, 1Q is better, isn't
it?
So, we are making challenges by providing two different buffers.
Why do we… okay, go back to your preparing meeting, right?
Go back to… commercial line hardware. Speculation problem.
What's the root of the problem?
This can happen in all the important… It can happen? Out of order execution? Yeah,
so we wanna… we… we have out-of-order execution. Okay, very good. And then, there.
What was the… instruction. What kind of instruction you have?
As a root of a dependency graph.
When you draw the dependence graph, what is the root instruction?
No, isn't it?
So look… so the example I gave everything. You have a first line of code load, and
then you loaded the data from memory, you do computation, and then at the end, you
store that. Can you see that?
This is a typical way of a user program.
You bring data from memory, and you do operation something over there, and then the
computation results will be right back to memory, right?
Okay, then tell me, why you want to separate Buffer for read and write.
Read the news?
Lord.
Right means store, right?
Your goal, your architecture, computer architecture, right? You're a designer.
Your goal, number one goal, whatever you do, reliable, energy efficiency, whatever,
the one goal you cannot forget in your lifetime.
Performance!
Okay, tell me, why do we want to separate buffer?
To reduce the throughput?
Improve throughput? Okay, how can we improve throughput by having two separate
buffers? So.
I'm giving you a hint, right? This is a typical way you have interview in hardware
company and any, you know, consulting firm. Interviewer knows the answer.
And they want you to be successful, so they're giving hints while you're talking.
So don't dwell on your thoughts only. Listen, listen, listen. So why I give that
information?
If you do request, like, you haven't registered. So you are breaking the order. So
this, that's the hint, right? You are… so whatever load store, load store, whatever
in the queue, I just blindly serve as it is.
Versus, I separate buffer.
Which means I would break the order, okay? Then I didn't give a hint. When we break
the order, which one I want to give a higher priority? I gave a hint on what?
Dependency graph.
When you have, you know, dependency, your load is always the root of the
dependency, means if your load handled quick, your overall execution time will be
reduced, isn't it?
Until the data is arrived in the CPU site, you can't do anything. You just stole
there. Whatever you have 1,000 functional units there, 1,000 registers, you can't
do anything if the data has not arrived, right?
So, we always give a higher priority to
read, yeah. This is a… you think it's common sense, right?
But until 2000… Three, people didn't think about it. We had the Womb effect.
Okay, and in 2004, I met, Intervi, Pentium 4 designer, and they proposed to have
two separate buffers. And then I… I listened to the idea, and then we didn't have a
separate buffer. Yeah, we had only one buffer.
Okay.
So, if you know the fact very detailed, okay, tables in the detail, you can…
propose this is not a difficult idea, isn't it? We know load should be
served first, if you have a store. Store is always the end of a dependency. It is
not… well, then you have another tree, and that load
load for the… just before stored. Then it is, right?
read after I hazard, right? That you can't wait, right? Then another time, do you
really need to get data from memory if that stored data exists in the CPU?
We can use it. That's the way we handle, okay? We look at the buffer until… so this
idea, just they said, oh, we will serve as… as a region, so we…
wait until draining happens. Draining means it's updating the memory, and then you
really read. And then, yes, you… as long as you don't care about performance, you
just care about correctness, you can do it. But as a computer architecture, we care
about
performance. Then what you do, you can examine the buffer, and then if there's same
aliasing someone asks, right? I promise we will talk about it later. Yes.
we need to compare the address, and if they're the same, we can fetch the data from
the buffer instead of going to the memory. There are a lot of things going on in
the memory side.
Okay.
When you said that earlier, it was only one buffer, right? Yes. Yeah, so you… in
CPU side, we do have an aggressive out-of-order execution, but in memory, we… we
had only one buffer means in order, right? Right?
Okay, prevent a lot of problems. As soon as we de-separate, separate two buffers,
and then, you know, aliasing problem comes, and the how-to server, you know, read
after, right, but still, we want to do it
to improve performance.
Okay. It's not that, like, for me, it's not that long time ago. Like, I was very
shocked when I hear that
idea. We had only one bubble for how come? Right?
But then, these kind of things are happening.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
But even with a single, between
out-of-order cores. When you have a read after write hazards on memory access, you
need to make sure, write updates first, and then you read.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so let's do Chris.
This time, I need the volunteer.
Oh, anyone submit?
Can I use your solution?
Anyone? Raise your hand. Nobody, you're busy with the midterm preparation, right?
I'm excited!
Oh, you do well!
I run into you in the, you know, hallway, and you said, I'm studying for 614.
Your advisor will be upset at you all the time study for sex reporting. I had that
complaint before.
My students only work for 6.14!
Huh?
18, 18, 18. 18.
Okay.
Anyone? Oh, we have a two-way example, very good.
Wait, anyone?
Challenge? I'm tired. Anyone come, do it. I can tutor you.
Okay, I need a new person. Anyone didn't volunteer before. You didn't?
At all? Okay, come.
Okay, so with this, what do you need to do? First step.
All these addresses… hexadecimal, so convert to the binary. Let's do first four in
the class together.
Okay, so use this space.
Let me, annotate, and then color blue.
Well, can you… okay?
Convert to binary.
Good touch.
Only 3?
He… you do maybe up to two, and then we have another volunteer. I want to make sure
you all know this.
Indeed.
Yeah, you can clear everything.
Okay.
Whatever.
Hallelujah.
Even if I'm not familiar with the English.
One more time? Okay.
Oh my goodness.
So anyone volunteer to partition the field?
Yeah, so you do first two.
like, this preamble thing. It's the same. It's 27 bits for the tag, and so on.
Okay, let's do two first, and then partitioning.
Yeah, we need to look at the block offset first, right? How many bits do you need
to have as a block offset?
Two? One block is one word.
Oh, because it's still 32 items total. So you do draw one big rectangle? Yeah.
So it's listed as the rogue way of lifting.
So, first of all, look at the black flag. It's one block, it's one word. One word
is 4 bytes, so how many bits?
Yes, yes, same size, okay. No, no. Let's do easy one first. That is the index
field. We want to do block offset first. What is block size? One word? 4 bytes? How
many bits? Two bits, so it will be last two, actually.
active formula as it was before. Okay, so let me do this.
Okay?
Google Group, this address.
And then these two off that. Okay, then you can use this… for this kind of
question, if you draw cache configuration, like a table form, it would be easy,
right?
So how many tables do I expect when it says two-way?
Two table. Two-way, two table, okay, two table. And the total number of blocks you
should see in both the together table.
8 blocks. Why? It says your total cash size is 32, and then your block size 8, you
divide the 32, and then you have only 4 blocks, right? And then 4 blocks in one
table, but you are… need to bring… construct two tables, right? So you cut it half.
So, 2, 2.8.
Do it together, just submit it while you're working on. 8, I think… 32 bytes, and
then, one block size is a 4x, so you divide the 32 by 4.
Because you're… Oh, you can think, like, I have 32 students.
And then one team, one block, is 4 people. Then how many teams do I have?
32 divided by 4, right? Yeah, 18, yeah, that's the way.
So, so it's a total cash size is 32 goods, and then what is your block size?
block size is one word, so you have 32 people, and then one block means one team
is, let's say, one word, four people. How many teams do you have? Two. Right.
as you do this 8s. So, to specify 8 different things, how many bits?
3. If it is directly mapped. Yeah, yeah, I see. So, see here, if it is directly
mapped, you have 8 and 3 tables, but since you circle two-way, you cut it half.
So you are having, like, a table cached this way.
the table.
Each one has 3, because the total is 8, you cut it half.
put one here, one here. Okay. Okay? Then, then how many bits do you need to have to
identify one of the four rules? Take out the… so you saw two is cash index, the
rest is a tag. Okay?
All right. Well, we can start from…
Here. Thank you. Maybe I will have another volunteer.
If you're not cleared, come.
Alright, so… Okay.
Nope.
No. Okay.
Can I have your attention? This will be the last time I explain how N-Wave works,
okay? So, with the block size, you come up with an offset.
there are 4 buys, so 4 children, 2 bits, right? And then what you need to do is,
oh, total cache size 32, and you divide by block size, why? You want to know how
many blocks, because this is the way minimum unit we are handling in the cache. How
many?
32, and each block is 4. So how many blocks you can have?
8, right? But if a directory map is a 3-piece.
Right? You have 8, I deleted 8 different rows, and then you will have a 3. However,
it says
Two-way means you have…
Before, very good, can I… bottles? So you're having one table directly made. Two-
way means you cut in half.
That first table goes here, the other goes here. Can you see that?
So then, Two-way means the same index, you have of flexibility, either go here or
here.
Okay? So, your… how many rows do you see here?
4. That's why it's 2 bits.
It's not coming from two ways, okay?
If your total cache number of blocks is 16, then 2A means 8, then it would be 3,
okay? So that's it.
So, you can do this one first.
That's the same thing as what I said. Or two, I think you can put it… Oh, you put
the rest.
Oh, you could… oh, so this continues. Yeah, yeah, yeah.
For engineering?
And then how about the second one? You're looking for something? The second one,
you have index 1-0, right?
So it is empty, so you want to use this… Yeah, it's the same…
Some number of ways is okay. No, no, no, you… one… you put two.
10 compare. Because then anything can go anywhere, and maybe there's no… 1, and
then B means, 1.0… Well, 101 1, right?
10110, so there are two tags. What is your tag?
Juan. Is there for one? Yes. Okay, let's do it. Next one, more marker.
Cool.
Yeah, okay, get familiar with the binary, okay? 0. Okay, so what is the index?
And then the tag, four, four… No, no, no, no. 4, no, two, two, okay. So, here.
Alright, thank you.
Anyone? You want to do more, or can I stop here? Can you do the rest? Do you
finish?
No?
you think you will do later, right? And then in… before mid- before final, you have
so many things to the town project and other things, and you don't know even in
this basic thing.
Just do it now! Don't leave the room if you don't know this. I'm telling my child
the same thing. Don't leave the room. Yeah, so if it's nothing that is available.
So, you brought this…
Why don't you use this space to have another block? But then, in hardware, do you…
will you do swapping things? Why?
Okay, very good question. Let's look at this second access.
Can I have your attention? Second access. So, with the first success.
You have 44210, so you leave the data here, 442, okay? Tag is a record data, okay?
The next time you have 4419, then you have another same index 1-0. You come here.
There is a valid only one side, the other is invalid, so you examine the valid one,
match with your tag. Your tag is a 4-4-1. Unfortunately, it's a miss, okay? So you
go to memory, you bring.
Now, do you understand? It's a tag, it's not matching, it's not your blog, okay?
It's a different blog.
So you bring your own block, and then while you are putting here? Because it's
empty, right? You use another space.
Yes, it's the age.
So let's do third one! Let's do third one, okay? You have this address, and the
index is 1, 0.
Although, okay, I prefer not to write down address here. Why? This shows better,
you are sharing same index for two tables.
So, with one giro, you go there, there are two…
Cash block available, and you do tag matching with both.
And then this one is matching, so use the apply data here. We didn't write down
data, but it's just a table. There are data, actually.
Okay, we need to have a replacement case. We will have a replacement case. Because
it's a… it's a… when you have a first road, and the second one, you have empty
space, you bring here.
But if I… with this, if I have, let's say, Full, full… 3, 9, let's say.
Okay.
So the… I'm… I'm having… I… I… I could… I changed the question, okay? 4439.
As a fourth access.
Then 44399 is 1001, so you have 10. You go to this row, okay? What are the tags?
Your tag is 443, and the same one is 441442. None of them match.
Then you need to replace. Now, which one you will kick out? We talk about
replacement policy, right? Which one? You look at earlier history, what was the…
among these two, what was the most recent one?
4 for 1. So you replace, because this has a fourth one, so you have a 3, 2, 1. This
is the…
Just before it was hit here.
Turn your phone over to you, right?
No, no, no, no. What about the earlier accessories, 4429?
It was, it was brought first, but then your reference happened one more time.
Very recently, then it won't be kicked off. Most recently. Okay, then if so, this
will be, kicked off 443. Okay?
Right? Did we do this in parallel? Each access happened before the other. Well,
like, no, like, setting the first paper check, the second…
Are you clear about this? Are you done with it? You should, you should. Think about
it.
Obviously, there's not going to be replacement in the multimed.
So there's no way for there to be a replacement.
This one's already here.
I mean, like, the same… You're done with that?
How many hits do you have? No, why? When you miss, miss means you will… you belong,
right? You need to have a space.
Then you kick out, right? There's nothing at all that you… what she wants us to do
is put the whole tag.
Yeah. For how many? Anyone done? In reality, there's not 5? I didn't bring hope…
This is 1, 2, 3.
Okay. You were excited to met first.
And then, when it says a two-way, you cut it half, put two tables in parallel like
this. Can you do that? This will make your life much easier. Why? With the same
index, you have two rows, right?
We are all fine, right? Any questions?
Same question. So what's the question? So, with the index, one index, do you agree
there are two items met?
So, you do parallel, in hardware, you do parallel matching. So, let's say you are
in third one, your third one is 441, so your index is 441.
You are examining, there are 441 in this room. Here, it's a heat. Hit means it's
the same block you have already. You're confused because the earliest
Let's say you are now content is a 441B.
They are only all that different. They are the same team, same block, but different
also. Data is there. Especially kids like brain. See, it's not easy concept. You
have a block, and then we fight to you.
I didn't bring… Oh, so this is a miss?
I think so. So, from first, we have, 1, 2, 3, 4…
4 hits? Yeah, those 5 hits. Okay. Let's move on. So, someone here, you just, like,
look at the…
first of three hexadecimal, and then you said, oh, they are all hidden. No, because
your index is actually in the last hexadecimal number. Look at that, E.
E is 11. F1101.
F is 1111. They go to the last row, not always the same room. Okay. So let me…
You're telling Irish style, you're telling me the football.
Pretty much all taken at once.
So, we are moving on. Your homopho will be released, okay, your after… So, after
your midterm, it's kind of fast, okay? You will have a homopho released, and then
you should start immediately, because it requires some coding.
I found that some of… for some of you, it's a challenge, okay?
You need to start, head start.
And then, then you need to think about your term project at the same time.
We will move forward as soon as possible. We will release right after meeting. It's
a single person, a way different thing.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, let me give you a recap on memory hierarchy basic design.
So when a word is not found in the cache, we call it a miss.
Then we fetch the word from low level in hierarchy, which requires a higher latency
reference time.
Low level may be another cache, or the main memory.
Also, when we fetch the word, other words in the same block contained within the
block will be brought together, so that we can take advantages of spatial locality.
When we brought a block from low-level cache, low-level memory in hierarchy, we can
place the block into a cache in any location within its set, if it is a set
associative.
If it is fully, you can place it anywhere.
How we determine where to place? It will be from the cache index.
Which is, block address will be modulated with the number of sets in the cache. We
call it cache index.
So in terms of placement, we learned three different things. One, directly map the
cache, which means only one block per set, you have one designated spot, indicated
by cache index.
Kim, Eun J
Kim, Eun J
[Link]
feeling.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The extreme case we discussed?
is fully associative. We have only one set of cache, so you are going to use the
whole rest of the memory address as a pack.
Kim, Eun J
Kim, Eun J
[Link]
Okay.
I think you missed this part.
When we arrange cash as a fully associative.
Do we need a cash index?
too slow.
So, your index field becomes zero. You don't need index. So, see, when we draw,
Tishani?
For partitioning?
You have offset, And then index and tag, right?
Think about it. When you have a two-way, one… one bit is gone.
your index field shrink. Can you imagine? If it was an original 16,
then it becomes 8, right? 2A. If it is 4A, it will be 4. Another bit is gone. This
bit will be…
Change to tag field.
So if it is fully associative, so you don't need any index field, everything
becomes tagged, which means you need to search all the memory space, cache space,
to figure out which block is… I'm looking for, okay? There is no designated spot.
Okay?
All right, so I will stop here. Good luck with your meeting.
Fair?
Did you have to simulate that? Yeah, of course, let me see.
Yeah, that's so much better as a teaching tool.
See, this is how they should be teaching this stuff. Can people do this? Yeah.
So, Monday, unfortunately, okay.
I… I cannot have a special office hour, okay? Monday, you will have religious
sessions, and I already gave me some questions, so…
Here will give you a lot of hint. So, attend the division, that's, okay? Thank you.
Wait, so she's saying she won't have extra office hours, but on Monday.
Oct 27:
Beautiful.
I'm sitting in your senseless, but… Good afternoon.
So it's time. So the, as announced, Homework 4 and 10 project details are released,
so I will take some time to discuss about your term project.
There are two options, if you see, okay?
But actually.
I feel like they are saying why? So if you start to do Homo 4, you can see you are
using ZSIM, very detailed simulator, okay? And work on the…
RRIP paper, ISCA paper. It used to be one of the top projects to put paper.
Students choose around that year.
then I really liked the, you know, the topic, and then I gave this as a homework,
okay? Then your term project, if you find any successor papers, actually.
this year, last week, while we were taking midterm, I was here. Okay, micro in
Seoul, and then I… if you look at the main program.
And if you search with a cache.
There are a lot of hash papers, okay?
So, let's go through the, overview first. So, there are two options. You can choose
any paper from last three years.
mainly ISCA, HPCA, Micro, Aspalas, those top places, okay? Some, you ask, how about
date, or that, system paper, or OSDI? Maybe you can, but, this is an architecture
class, right? So I…
I think you want to choose a paper from top conferences, okay? Maybe, alright, at
the end, you may not get highest mark on this top project, but in your resume,
think about it, in your resume, you will put the title, what paper you choose.
Don't choose a paper which won't make you proud of work, okay?
So, these are top, top…
places in the computer architecture. The thing you read and study about it, you
should be proud of, okay?
This is what leads you in the top place. When you go to job interview or any other
place, this define where you are. Can you see that?
Okay? Don't compromise the level of work you want to do for some projects. I know
Once you get meat on paper.
Okay, so… okay, we will talk about it later, okay?
But the thing, so we… we gave a separate score for Proposal 2, right?
Research, a lot of times, what you choose is very important, okay?
I'm a bit disappointed if you choose a paper which is very, like, a 5-page and then
simple, you know, very small extension of RRIP. Yes, I…
Because it's in the syllabus and in the written form, this is a recent work, and
related on cash replacement policy, and you can do it, right?
This option, too, actually, we did from last semester, so you may get the code from
other people. Remember, we know, we all have old code, okay?
The thing is…
If you choose option 2, most recent replacement policy, there are papers from this
year ISCA, this year Michael, I just went. You should choose papers from there. Can
you see that?
You should choose a picker from… this is your ISCA, this is your micro. Imagine you
have an interview with NVIDIA or AMD, and you talk about this micro year, okay, for
your internship or your job.
they will be impressed. Do you know what I mean? Oh!
EJ doesn't teach, like, all the RI, you know, RIU policy, all…
Do you know what I ended up talking about?
So, why is coming off?
So, for me, it seems like there is no difference, but for those who already started
your research in architecture, some of our master's students approaching me and
whatever, then maybe you won't work on replacement policy, right? I allow you to
choose any other topic, but if you choose option one, I want to know what
kind of work you are proposing, like a Paul grad student or a Daniels student. You
already work on a computer architecture problem, and you want to extend your work
for this class, but I need to know what the
You know, portion you will be using for classwork, okay?
Okay.
And then option 2, yes, any replacement policy, but if you see here, like, a cash
replacement policy, right? You wanna…
You want to choose this paper. Can you see that? This is the most recent paper on
replacement policy. I don't mind all of you choose the same paper, okay? But as you
can see, there are several more.
Can you see that? There are very interesting cash replacement, cash ship work
presented in this year, just last week. So if you choose these.
people will be interested. What if you can beat that one, right?
In my history, I never had that, okay? But you can change the history, all right?
You are smart, and then, how many of you can be in a team?
That's it? 3, right? Okay, so 3 of you, you can do that.
So, the…
Option 2, we have a very specific way to evaluate, because we will see the
replacement policy, but I'd rather have a low hit rate with the most recent work.
Can you see that? Then you want to have a good analysis.
how this paper can be, right? In the top conference, even the replacement, the hit
rate is lower than RLIP.
There should be reason, right?
But that's, food research you can do.
Any question?
Okay, so to do your term project well, you need to finish your homework first.
Okay, don't delay finishing up Homework 4. It shouldn't be that difficult if you
start from now on, okay? Don't wait until the deadline. Target to finish in a week.
Okay, day and night. You know, in the micro, this is our community, okay? If you
don't like this culture, don't take computer architecture as your field.
Okay, in the conference session, people present, we listen. At the same time, we
work on Liberta, because we had a deadline.
And I couldn't go to BanQ because my students work on rebuilt. I wanted to help,
okay? This is the field. We are moving so quickly, and if I, you know, enjoy time,
and I may have a rejection, and I need to wait one more year, right? This won't
work.
Okay.
Exciting?
So you can experience some taste of research with this project. So, I hope you
challenge yourself, okay?
I know that… what is this puzzle effect? So let me change the…
Perry. Perry should be somewhere here.
And we can see that I raised here.
Let me use this. Morgan. So, I can take your question.
Any question on Tom Project?
So we need to upload things, but I did not find any linked in the document. Yeah,
yeah.
I will ask TA to post the link in Piazza. I didn't know. It should be there. Okay,
and I also… this is kind of a time project, I tried to give credit for your effort.
So, I see commit time.
Okay, so you should start early on, and you shouldn't have all comments at the end,
the last two days. Do you know what I mean?
So this is, from now on, you need to work very hard, finishing homework for as soon
as possible, and start to do this project. And,
Since we allow for three of you work together, there are some teams
argue about contribution. So, we will ask you to report evaluation.
peer evaluation of each other, okay? This is the same way any computer architecture
company do. You will be evaluated by your peer, okay? So you should work hard, and
you should be… how can I say?
prompt with, any, you know, discussions going on, and you should involve Alia on.
In the world.
I don't think they are fair. Let's say there are two people work together, 50-50
contribution. There is no such thing, okay?
So, if you think you have 50% of a contribution, maybe in reality, you will be 30%.
People tend to think their own contribution more, right? So, if you think, oh, I
contribute equally, then 30, and
So, what I'm trying to say, yeah, sometimes we are supposed to teach or talk about
group work. This computer architecture field, wherever you go, you will work
together, right? So, let's say I contribute 51%, the other 49%, and then
In my deep mind, I have a complaint, right? Why he or she doesn't reply to my
email, why she doesn't come when I
cannot sleep, and then she… and then she have a nice dinner, right? Something like
that. Always there, because in the world, there is no 50-50.
I usually say this thing for my undergraduate class. When you make a team.
Either, if there are two, either someone knows better than you, or you know better
than the other one.
Then, my kids actually complain about this teamwork.
the other one never do work, and I'm the only one to do everything. And I ask, you
are the one who care.
Right? After all, the end, it is your grade, right?
So, I told her, I told him, you should work hard, but…
In that position, this is a very unique opportunity that you can learn how to be a
leader, okay?
A leader, if you know a little bit better, or if you care more than the other
person, you immediately figure out what can be done by the other person, the other
people. You need to manage the team. That's the leader role, right?
It's a very unique chance. So, when I… actually, I complained a lot when I had to
work in a team, like, in my undergraduate.
The… my… my school, among 550
students, only 22 girls, and that digital logic class, over 100 students, there
were 4 students, 4 female, right? So, very skewed, and
So, it's hard…
to be, how can I… respected by other teammates, okay? So I had to fight. And there
was, one time, software engineering, it was four or five of us working together.
That guy all the time complained.
all the time complaining. I think he cares more than us.
Of course. But he doesn't know how to lead a team.
Every time he complained.
Okay, and so I didn't want to work for that project.
Because every time we meet, he complains.
Okay, what can I do?
So, I have some Peter experience as a team, so I understand that we have… you have
some, you know, bad experience, but two things.
If you are better than others, you try to be a leader in a team, okay? You are also
responsible to lead a team.
If you are the other… yeah, I was lucky sometimes, my teammate is much better.
Like, I do remember for database, we are supposed to implement the people 3, and
actually, the guy who teamed up with me is a genius guy.
He came to the college when he was 15, others 18, and he's, we… I nickname him
Walking Debugger, instead of Walking Dictionary. He glanced the code, he knows
what's wrong. He's so smart. So, see, I don't have to work, right?
But I sat by him every time she could, and then I asked a small part can be done by
me, okay? So when the TA evaluated our team, he asked a question to me, too. I
understand everything.
Although he leaves, I know the details. And those details were in the exam.
I ace that, okay? So, don't exploit if your teammate cares more than you, okay? At
least you bring water, sit by the partner, work together, because it's your time to
learn from him.
Okay?
Either way, you will be happy either way, right? There is no 50-50 in the world.
Okay.
All right, can I go to class? Any questions for Tom Project?
Homer?
Good.
So, don't try to get code from earlier class, okay? Because I expect you to choose
the most recent paper on cash if you choose option 2. Can you see that?
Okay? We have good papers. If you look at ISCA, I-S-C-A2025, it was Tokyo, you can
find also cachet replacement policy papers, okay? There are some in GPU, but
Well, you can… you can try those in CPU environment with the spec benchmarks, too,
and you can report how it is different from data, you know, machine learning
applications. That's a good study, too, okay?
Any other questions?
All right, let's move.
Well, I have a question. Yes?
For homework 4 in the paper, we had two different options for the policy. Fp and,
heat promotion. Which one should we implement in our code?
I don't know the detail, you should read the… The homework.
Anyone has a better idea about this? There are two policies. I didn't read the
homework descriptions. I will answer. Okay, in Piazza, maybe you can put that
question in Piazza, we will… But the issue to be clear.
One of them. And, like midterm, in final, I will use RRIP paper, okay, to make a
final exam question, okay? So you need to know how promotions work, and then those
counters work. Study well.
on the example that Iska, the paper gave, okay?
All right.
Excellent.
Nope.
See you then.
Deadlines, okay, Pat.
Basically, you're so much.
Pick the phone.
So you need to make your own ketam, right?
Yeah, that's good.
So, we are in the,
memory hierarchy, we are done with Appendix B, and then we started, chapter 2, and
then we finished Overview, and I think we finished Overview. We are… are we done
with quiz 19, or we didn't? Okay, so quiz 19.
So, a week is too long, I forgot everything.
Yeah, I had such great time. It's… it's… if you go there, there are, like,
the… Maybe it's a web… it's not here. There are pictures people post.
But there are so many people attended, so it was crowded.
Yeah, our… our field is growing so quickly.
Alright.
So, For this question, you… what you need to remember is…
Averaging memory access time. Do you recall that, formula?
Okay.
do it. Average memory access time.
Okay, let me read for you.
Consider there is L1, and the excess time is 1 nanosecond. It means the heat time
is 1 nanosecond. When you
have a data, it will take 1 nanosecond, and within 1 nanosecond, you will figure
out it's miss or hit, okay? And then, hit to rate is given, 95%, and there is a
chance cache design changes, so hit rate improved.
0.97%, but excess time increased to 1.5 nanoseconds. This is a very,
Practical example, where a lot of times, let's think about how can we improve hit
rate.
Do you recall what is cash? Maybe you were so busy with the meeting, you forgot
everything. What is cash?
Please, Kath. Yeah, you're, you know, you interview with NVIDIA, AMD,
Google, they ask, what is cash?
It's a smaller region of memory with a lower latency than regular memory. Okay, so
you have a smaller, okay, smaller space, memory.
than the actual memory, then you will use for locality, right? Okay, here, small
lure space.
How can we improve a hit rate?
We were going to talk about those, right, before that, but can you use your common
sense?
Increase merit. Increased merit. Bigger size. Okay, and then can you explain why it
would take a longer time, 1.5 nanoseconds?
Yeah, so you have bigger space, means you will have… do you remember the
partitioning?
Block size, we didn't change, right? So block offset field will be fixed, but what
changed? If you have a same organization, let's say directly map, if you double the
space, what's… how many bits do you have? It's one more bit, right?
And then, cache is actually… directory map is a linear table. Linear table means a
search will be implicitly linearly searched. So let's say 10000, then you start
from 000, and then you go down. It's not random, okay?
So it takes time. So when you have a bigger cache, it takes a longer time. Can you
see that? All right, with this…
What conditions must be met for this change to result in improved performance? Tell
me.
There are other things not discussed at all, so you can use only
average memory access time formula. What is the average memory access time? What's
the average time to get
Access memory.
Heat time plus… Heat time plus this rate, multiply.
Miss penalty. So, mis penalty has not been discussed here, right? And so, mis
penalty has not been deferred, so we assume they are the same, right? Okay, so tell
me what is the condition?
Your… your equation only miss penalty is unknown variable, right?
One minute.
I did the quiz, and I found out that the penalty of accessing memory should be
greater than 25 nanoseconds for this change to have a positive impact.
So, penalty should be, like, that was a condition that I got.
Go ahead. Everyone hearing?
You can, you can share your work here.
But this is the easy stone, right?
You got it?
No?
Okay.
So if a miss penalty is less than 25 nanoseconds, it is beneficial. How do you do?
You have a true memory access time, and then the changed one should be…
Less than original one, okay?
Here? So you can use… And here.
Maybe this… Wait a second, okay.
So you can write down here.
I can just… Okay, whatever. I don't know. Don't ask me.
Okay, you can just write down here.
Alright, so… So, everybody got it, right?
Don't do just empty hands. You think… you know myself, your meeting wasn't easy,
right? Why? Because you didn't practice during class.
it requires some time. Your final would be better. I'm sure you will have enough
time for
Final. I'm sorry for meeting we had to squeeze in small this time, and then I… I
was told you run out of time.
Hmm, you're okay?
USC? Okay. Right.
So just to leave… Okay, the other one, new one.
Oh, whatever.
That's the only one. Yeah. Interesting. I like this way of thinking. Do you put
this way, or did you put 1 plus this?
Do you see? If you have a different way of writing, tell me.
Right. Do you agree? Yours and his same?
Okay.
Okay, because this… he separates hit case and missed case, and hit case only takes
1 nanosecond. Missed case, 1 plus penalty.
Right? And that's similarly. The other way a lot of people do, always a hit. You
search your cache, and then
If it is missed additional time.
So that's what you can do, okay? Thank you.
So, this old one should be bigger than the new one, so then you can get the…
formula, the answer with the penalty. What I was asking is this one. Let's say this
one, a lot of time, heat time, or then if it is missed, it will be, the missed
penalty, okay? Actually, these two are the same.
Alright.
Okay, we will move on, and then you… we… we discuss about this aspect, so… you can
answer. One…
Okay.
Alright.
We already, know what, you know, basic seeks, but, let me… What's happening?
So, you pick this, Do you all agree? Bigger block size?
Okay, let's go.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides.
I'm going to talk about 6 basic cache optimization techniques, which is very
important to understand first to learn other advanced topics.
In this chapter, to reduce the average memory access time, we're gonna learn 10
different advanced cache optimization techniques. Before I cover those, let me
introduce, recap, the 6 basic cache optimization we can think of easily. First, we
can increase block size, bigger block.
may reduce compressor misses, because the less number of first misses, you can fill
all the cash area. However, it will increase capacity.
Or conflict misses, because you have less number of different blocks.
Because it is a big, when you miss, it will take a longer time to bring it from
lower-level memory hierarchy, so it will increase miss penalty.
Next, we can simply make it… make the cache bigger, larger, to reduce miss rate.
However, if you have a big cache extreme case like memory, it will take a longer
time to access, so it will increase the heat time. So, basically, first level, when
you design first-level cache, you really need to consider the heat time should be
matched with the CCT of your processor.
When you increase the heat time.
Yes, it will increase power consumption, too.
Another way of thinking is we can provide higher associativity to reduce conflict
miss. It provides flexibility in terms of block placement. However, we need to go
through tag matching, which increase hit time. Of course, it will increase power
consumption, storing tag, and you need to have multiple tag matching logics.
Another way to use abundant cache area available through high-level technology is a
higher number of cache levels. You can reduce overall memory access time. Instead
of going to memory directly after missing first level, you can go to second. If it
is missed, you can go to third, like that.
And, we can give a higher priority to readmiss over write, because readmiss is
usually the cause of CPU stall for data dependency. So in main memory controller,
you can separate read queue and write queue, and then you can always serve
readmisses first. However, when we separate those queues.
destroy a fee for order, we need to have additional mechanism to make sure to
server read after write hazard.
The last technique is
about virtual memory. Because we have fixed the size memory, we use a virtual
memory, virtual address space, so user will use a virtual address space. You need
to go through the translation. From virtual address, you need to get physical
address to
Access cache. So, oftentimes, level 1, they use a virtual address to avoid this
translation time delay, which is in the critical path, or we can do a little bit of
different tricks to reduce, avoid address translation time, and avoid translation.
I will cover this when I talk about virtual memory, one more time.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so what did, six pages, we… 6 pages, we discussed,
if I share recent trend in memory hierarchy design,
It seems like computations are everywhere, so I saw a couple papers, even cache-
side SRAM, it is coupled with computation.
So, instead of, they bring those operations to CPU, when you have a cache, the
computation is done in cache side. So, earlier, we had, some…
The cache itself…
systolic array way. You have a multiplication, like, even you don't need to convert
digital to, you know, the,
Analog to digital.
And then you do. But that device itself, analog, you just do go through current,
the multiplication is done. There were some earlier work like that, but recently,
when I attend the micro, what I see, they put… it's a trend. We put all the
computation everywhere, and then it's… since the,
competition in the cache coming back. We had a dose paper a couple of years ago,
no, several years ago, now it's coming back, and they test those ideas with
machine learning applications, okay? So there are some locality, but then CPU can
do offload about those things. So, yeah, besides the basics, okay? So the last one.
Do you know what is a virtual address?
It is also one of a big topic for machine learning system design.
So when we have, huge KV caches, and then, you know, you will have a page fork. So
what is a virtual memory? Let's go over, you know, basic. I know I have a separate
slide, so we will go over again, but then let's just a…
Briefly discuss about it.
What is a virtual memory?
Logical addresses. Logical addresses?
Yeah, like, rather than the physical address? Physical address. Okay, so when we
have a… very good… when we have a memory hierarchy, what is the top one? Do you
remember? We always have a pyramid, right? What is the top storage? Smallest…
Fastest.
register. And then, second.
Level 1, level 2, and then SRAM, right? And then memory, TRAM. What is it at the
bottom?
Disk, right? Okay, tell me, with that scenario, tell me what is virtual memory.
In that hierarchy.
Where is it, huh?
Last one, we see disk as memory, virtual memory. So, physical memory size is fixed,
right? And limited, but we use disk space as a part of memory. How do we do?
Swapping? Swapping. Okay.
The unit of a swapping is called
Page, okay. So, what is a page for, then?
So, page… a page user program asks to…
get is not in the physical memory, okay?
So, the… in undergraduate class, I give this example. I go to events library, the
bookshelf is the main memory, and table is cache. And I don't find the book in
bookshelf in Evan's library.
I went to Boston, actually. Okay, right after micro. Harvard. Harvard has a huge…
the largest library in this country, right? So you need to fly to Boston, go to
Harvard Library, get the book. That is page 40.
Can you see the time difference?
Okay, so then, when you fly there to get a book, will you get just a book?
Like, in Evan's library, you don't have a book, you go to bookshelf, and from
bookshelf, even, you will have a block of books, right? That's what we have, a cash
block.
Okay?
So, when you go to Boston.
How many books will you bring? Just same as cash block?
you need to amortize the time it takes to go, right? So, how many books?
As many as it can fit into your
baggage, right? That is a page. Can you see that?
Alright, so you can see main memory as a cache of a disk.
Okay, then we will… we need to revisit all the four questions we did before.
Do you agree?
What is the fourth question we study about cache design?
First of all, you need to get this thing, isn't it?
this thing.
Tell me how it works. Okay.
Before you listen to…
the video set for virtual memory, why don't you think about four questions, and
study about that, and then listen? Then you won't forget.
Because virtual memory, main memory, is actually acting as a cache of a disk.
Wherever you have a cache, you need to ask a full question.
Placement…
identification, what is it? And write… write policy, okay? You need to ask those
awful questions. Like, what is… how do you partition? Is it fully associated or
directly mapped, right?
And then if there are higher set of associativity, what is the replacement policy?
You need to answer those, okay?
I won't go, and it will be covered later, but you should know. Okay, so somehow.
In the cache, you have a block offset.
Okay, when we have a page, we told you already, we already know, page is bigger,
right? So it's a page upset. Can you see that?
Okay, k jobs.
What was the last item we talked about? How do we reduce hit time, right? So, do
you agree, when you have a virtual address, you need to go through translation? We
call it translation. Get physical address.
And then use this physical address to get to know cache index to find it in the
cache. Can you see that?
It's a multiple operation, can we make it parallel? That was the last topic we
discussed. We have a way to do, okay?
But this is an important topic.
All right, let's go to the quiz, associated with this. And this quiz I really like,
because, it gives you
Idea, how cash works.
Borderlands.
Does cache map into the physical address of the program, or the virtual address?
So…
The partitioning you learned, identify what is the cache offset the block offset,
cache index tag, they are on physical address. So, first of all, you need to
translate it to
from virtual address, okay, virtual address is used by user, okay? Then virtual
address should be translated to physical address. How do we do? You learned in
operating system, right?
Page table.
Do you know Page Table? Each process, they have their own
page table. So, operating system won't allow user-use a physical address so that
they can touch everywhere. We protect our memory system
by giving virtual address only to the user, user program. When you program your…
you do TF, like a trace of memory addresses, all those addresses, not actual memory
address, it's a virtual memory address. And the virtual memory address will be
translated to physical, so actually cache is unseen, transparent to the user.
That's why a lot of programmers hate cache.
Yeah, be good. And there was some study about cache behavior for security attack,
this micro, so you want to go through some titles of a paper, okay?
Okay, any other questions?
So, let's go… let me give you some time for quiz 20. I like this question a lot.
I… More than a couple of times, I used this for final exam.
Rise through, right buffer… So you only have a level 1 cache, main memory
interleave the…
for independent, 4 by Bank. Okay.
Let's understand what is interleaved memory.
How it is different from multi-port memory?
So it sounds like they have a full independent 8-byte memory bank. So, you have a
memory.
Full of them.
Interleaved means what?
How this is different from multi-port memory?
Same thing. Multibank.
Okay.
Nobody have…
You don't have any prior… okay, anyway, so multi-port, when you have, like, a DRAM,
when you purchase DRAM, have you ever expanded your memory system?
Did you try?
Okay, then you will have a separate physical port.
means there are two access points if you have two memory ports. These two
independent, right? And then, a lot of times, first port, let's say you have 2
giga, 1 giga, 1 giga each, then from 0 to first 1 giga will be 1 DRAM, DDR, and
then the other.
Interleaved means you have only one port. Access point is one. From CPU, you have
only one serving point, but interleaved, you have 8 byte.
Interleaved means, first.
8 by here, second, third, fourth, like that. When you read 132 bytes is a 32
interleaved this way, so it will be read at the same time, in parallel.
So, with the 50 nanosecond latency, you can read 32 bytes, because all four work in
parallel.
Is it clear?
However, this is a, you know, odd design. Your memory bus is actually 8 byte, so
although you have 32 bytes already, this will take consecutive 4 cycles to deliver,
okay?
Remember that? And then clock frequency is 1 fourth, so this is 1, and then it will
be 4 times slower, okay? 4 times slower.
Instruction cache miss rate is 01%, and then 5% for load, and then store. 20% of
instructions are load, 10% are store. So let's determine mis penalty.
This penalty, once you figure out this is missed.
Then, how long it takes to get a data from memory. So, I would think, okay, here,
here, CPU is here, right? And then you need to use a bus to carry the request, and
then it will
interleave and then read, right? If it is,
the first one. You don't have any queue there. It doesn't say any queue, so you
just calculate how long it takes to get one block of data, okay?
Alright.
The miss penalty should be easy.
Okay.
So you can… you can assume… 1 gigahertz, what is your class cycle time?
1 nanosecond. So then, what is the bus's speed?
4 nanoseconds, isn't it? Right. Okay. And your block size?
32 bytes, right?
32 byte.
Yeah. You have a 32 byte.
Okay.
Bench?
So first, how long it takes to deliver your Read or write requests.
50 nanoseconds?
1 fourth.
So, okay, what is your clock cycle, CPU clock cycle?
One nanosecond. What is your bus?
clock cycle time.
4 nanoseconds. So it will take a full nanosecond walk to the memory, right? So
let's assume you are the first one. You don't need to wait, there is no queue, you
are the first one. So what is the memory access time?
it's only given 50 nanoseconds, right? So, to read from each bank.
to read 8 bytes, 50 nanoseconds, but these are interleaved, it can be in parallel.
So to read the 32 bytes, because these are interleaved, it takes only 50 cycles.
4 plus 50, and then you need to deliver that 32 bytes to the CPU, right? That is a
mispenalty.
So then how long it takes to deliver? Your bus with it is 8 bytes, so… but your
data is 32, so how many consecutive cycles you should use the bus?
Four times, right? But then each time it takes, 4 nanoseconds, so 16, okay?
So, you multiply, and then add them, how much is it?
70, okay. Let's go to…
Okay, so this actually score deliver read or write request, and then memory access
time, 50, and the bus transaction 16. Okay?
It's odd design. When you access memory per 32, your bus is at…
8, right? So, that's a thing. Okay, can you calculate average memory access time?
Now, average memory access time is what? Your heat time?
Complete. Plus…
Miss rate and miss penalty. So we calculate mispenalty. But remember, this cache,
you have separate data and separate
Instructional rewards. Okay.
This is saying that all the rings happen in parallel, right? This situation was
saying, all the things happened in… I will give you 5 minutes. You can discuss with
your friends. Be careful, okay, be careful. This is the way.
When we calculate average memory access time.
The… think about hit rate, miss rate, the rate is total memory request, how many
are hit, how many is miss, okay? So when you, for example, in here.
When you have 20 instructions on load, 10 is store.
How many total memory accesses you have if your total number of instructions are
100?
Let's say you have 100 instruction, and then the ratio of load 20, load store is
10. How many memory accesses you are having?
Tori? Oni?
Remember, when you execute each instruction, you also need to get instruction from
memory for every instruction. So you… when you have 100 instructions, how many?
Memory accesses happen for instruction only, 100, okay? And then, in addition to
100, you have 30 more
Data access, data cache access, okay? All right.
Let's go.
We have 16. Let's see. How many of you can get it?
Like, how do we know that it's one bus clock?
Why are you assuming it's… Right. Why are we assuming that one transfer to a
request? I assume this is an initiate request.
Access sent back. Why is the initiate request one… It gives separate misrates, one
for instruction.
Every 100 is much of a miss, and then when you have a data, load store, you will
have load 5%, store 5%.
That for… that… when we're issuing a request to the system, we're sending a contact
pointer, and how do we know that the pointer is one bus width inside? Oh, oh,
that's a good question.
So this is a 32 byte, and then, it didn't say the architecture, right? But it's,
32. Assume every store. So, it seems like one word is 32 bits. Okay, so we're just
making the… Yeah, yeah, yeah, yeah, yeah.
So, it can fit into a byte, all the information, your address space.
Second.
I can quickly…
I think 32 bits address, you need to have a 32-bit. There's a little bit of bits at
the end you don't need, because you're always requesting a lot, right? Yeah, yeah,
yeah.
And then whether you need it for read and write, right?
Let's assume…
Okay. It's time for instruction.
Just simply download the formula, and then you can do data, separate them. It's
within the same cycle?
Maybe just one.
Those are all there.
I'm trying to find some flights. Flight?
Thank you, buddy.
So, every access time formula will be hit time plus this rate, bis penalty, and
everything's there, right? For instruction, this rate is 0.21%.
So… Your heat time will be 1 nanosecond.
Because it's a level 1, it's operated as a CPU speed.
And then, if it is a miss.
0.001, then the… this penalty we calculate is a 70.
Okay.
Isn't it?
Yep.
So, anyone get, average memory access time for instruction only?
Instruction.
So let's separate.
So, memory access time for instruction.
and data, okay?
So, what is the instruction memory access time?
you're…
because you have 1 GHz, you're… we assume the level 1 cache operate with… this is a
very important assumption, we will talk about this later, okay? So, it says…
Yeah, so single cycle, so see? Single cycle heat time means this cycle time, same
as heat time, which is 1 nanosecond, isn't it?
You check cash, but if it is missed, you have additional
What is that? The rate is Uchiro.
Yeah.
1?
1%.
01. Is it correct?
And then it will be… this penalty is a 70.
Similarly, you can do data. Data 1 nanosecond, if it is missed, For load… 5… 37… 5…
It's too much.
Okay.
I have a different idea. Is this… so this is for load?
How about store?
Okay. Store, as actually it says, is a write-through.
right, buffer, so it does not store, right? So when you have a store, what is the
actual missed penalty?
Do you have any missed penalty?
Yes, I will put Jiro.
But, like, loads just constitute 5% of the
So I need to multiply one more time, because this among 20 is 5%, so it should
multiply one more, where I don't have a space.
Okay.22, okay?
Notice also… there we go.
0.21 vertiply, okay?
But here, why through…
right through, you don't have to store, you just put that recast on the bus, that's
it, right? So, mispenalty actually is zero. The missed penalty CPU experience will
be zero. Whereas the load, you should wait until data comes, right? So, it's a 70.
Alright?
So then, you need to do averaging over this instruction and the data, right? So
what's the ratio? When you have, do you have a concept of weighted averaging?
So is… so how many total memory accesses you are having now?
130. Among 130. Instruction will be 1 third, 100, right? So, whatever this number,
let me pick. The number is…
This 1.07, and this… Per instruction? Per fee, we're just calculating the
timetable. 17, isn't it?
17, okay? So, we need to, get the weighted sum, weighted averaging. So, the amount
130, 100 is actual instruction. Instructions have… execution time…
Access time is 07, whereas the data is only 30%. This is 1.7.
Okay, this is what I think, but you may have a different way of interpretation.
Any other thoughts… thought?
Some students didn't, separate instruction and data. They put them together.
So… then how it is. Hmm.
So, in death, it's the same.
And there's no way to, like, come back to it, no good way to combine those. So I
think that's just the answer, right? She's doing the weighted average.
Can someone quickly calculate this way? Like, let's say we put them all together,
so one nanosecond, and then,
Among 130, 100, this will be,
That would be average, that would be average.07, right?
And then, 130, 20, actually…
20 is 0.2… 100%, 70, and then the rest, 130, 10, and then whatever, 0.
But that's… that's average. The results are a little bit different, isn't it?
Instead of just average…
Which one is correct?
The only thing is that when you separate load store, or you pit them together,
that's the thing, right?
It will be same if you have a 30 with this, but if you separate, it will be only…
this will be affected, it will be gone.
I would accept both of them, because at first, I saw this, by way, separate.
instruction and data, and then put them together. But the tricky part comes when
you have a write-through, the last term equals zero, so when you do weighted
average, maybe you shouldn't take that part.
Can you not… how can you consider 30 when you're not even attend? Yeah.
But anyway, I will take both of them.
Alright, any question?
Yes, hello.
It was. Yeah.
Oh, so you should have one load case and the other lower third case.
store case, then you will have this formula, identical to this. So this should be
correct, not this one.
Right. Very good. Okay, everybody!
The final answer should be this one.
Okay, this is not the way… like, it should be 20. Why? Because you cannot put this
way. You're very clever, right? You're right. Because… because when we look at this
one heat.
or miss, and then you need to have a separate case for store. Heat or miss. But
this heat and the miss for load, miss for store, you cannot add this. Maybe you
should add this too.
Right? Then divide by 2, if you… okay, all right. So this is the…
Correct one. Let me change my answer to…
Okay. I put the… that way.
All right, so meantime, anyone try to do C, okay?
try to get the volume of a memory access. So let me delete everything for you.
But this is a correct one.
This is the last answer.
Okay. What is that?
I don't understand where I was saying.
Why are we holding, like… We're already doing a 24 sentence.
I don't understand that at all.
So, is the average memory access time supposed to be the average time spent to
actually… So, remember here, it asks the number of bytes transferred. If it is a
hit, you don't see any transfer, right?
If it is a hit, you don't see. Only in this case, you see some data going on, okay?
So that's how you calculate.
go through. There are 3 different cases, right?
One for instruction, one for load, one for store. It should be easy.
Just one question.
The other one, the other one I put there.
You are asking why it is wrong?
Okay, so when we have an average memory access time, heat, or is if it is missed,
so you need to separate heat
the load and store case. My mistake, I put them together. That's not supposed to
do.
Right? So… you're having… let's say…
The average memory access time… access time for load. What is it?
Juan?
And then?
The load is 20%, and among 20, you'll have
5% miss, then it will be 70. Okay, how about store?
1, whatever, whatever is zero, right?
Okay, so you have a three number. 1 number, and two, three number, okay?
This portion, this portion is instructions, so it's among 130, 100 is instructions.
Among 130, 20 is load.
among 130, this temp is… This one, so…
You should add this one. Oh, no, no, no, it's always one, so it's true. These,
these are all…
Okay? That's why.
This is a ride-through.
Couple questions on that.
First is, the average access time. Is that the average time, given we already have
a load, that will be spent doing memory operations, or what is the average memory
access time mean? So, average memory access time is average time it takes to access
memory.
So, what is average? You need to have a total, right? So, total… among total memory
accesses.
When I say average memory access time load. Is that the average time, given we have
a load, or is that the average time per instruction spent doing a load? Oh, oh, oh,
okay, okay, okay, okay. So…
Like, if we have a store… So, what is… what is… Okay.
Average memory access time for load instruction.
for load, data load, not instruction. Instruction is capture the first one, okay?
Only data. Then you are… you check a cache first.
Only when you have data miss this additional time. If that is only on loads, why is
the 20% being multiplied, since if we have a load, it's a 100% chance that it's a
load? That one's, like, half of the loads.
So I'm doing it.
Okay, so… So, this is the way I thought, okay?
So I calculate the average memory access time for instruction. Let's say it's A.
And then I calculate memory access time for load data only, okay? Then what I
found, those memory accesses happened the ratio to 100 to 20.
among 130. So we do weighted the sum.
Boom.
So, so I know you're troubled why there's 20… 20% twice then, because it's 20 over
130 times 0.2. Yeah, that's true.
thought the way to do this would be to calculate, if you have a load, what's the
time? If you have a store, what's the time? If you don't have either, what's the
time? Then the average number of times each of those occurred would give you your
weighted average for an arbitrary instruction, the average time. The way we're
doing it now, I don't, follow. Yeah.
Also, it shouldn't be 20 out of 130, right? It should be 20 out of 100? Yeah, 80%?
I think there should be 20 loads, 20 loads, or… 100 cycles, plus…
But the thing is, okay, so, so, can you, can you suggest what, how you calculate? I
mean, I can say what I'm… The thing is, if we get rid of this term, the weighted
way.
Or re…
So what you're suggesting is a 1 plus, if it is an instruction, it is, it's just,
01 multiplied 70. This is additional.
So, we check for cash, and then if it is instruction, this is the additional time,
and then if it is…
Load…
This is additional time, and then this is… the store is additional time. So this is
how you suggest, right?
I don't know why there's four turns. Yeah, because these, ratios are not reflected.
That's why I trouble.
Which ratio? Because that, when we have this value we're calculating… So, for exams
for a low instruction.
There should be one hit time for the instruction, and one hit time for solved.
Right? You need to access multiple. Yeah, yeah. But you're only adding the access
time for one, because it's 1 plus…
So, I think that's morale.
Yeah, I agree.
So, it is your one.
This is offline, right? No, you are, you are having… if it is a heat.
And then if it is missed. This happens a ratio of 130 to 100, and this piece for
me, this happens the ratio of 20 from 130.
Okay, is it time slot? Okay, sorry. We will come back. I like this question.
But we will come back to this C question, okay? And then this, we were finishing up
the discussion. I'm totally confused.
But you need to have a weighted averaging.
Thank you!
That makes sense.
She was like, okay, stop, I'm taking it.
The test, right? She was scanning you.
If you're assuming, like, a 50 on it?
What I'm saying is…
So, we could say that 100% of the inspection, in case with a probability of this
penalty of this much. Or, it will have a 20% additional penalty, or 5% of the
inspections done this. And then the rest of this work, you have a zero missed
penalty for this one, because
You have your instruction accesses at the beginning, and then you have your 20%
load.
That's the…
Why'd she press it out?
Who knows?
Don't do the OnePlus, yeah, yeah, yeah.
Yeah, now I… This is in general, because the deeds are not,
There shouldn't be a point, too, and there are times we don't implement 2.
Seems like a really good metric.
I heard you saying that the other day.
Oh, okay, gotcha. And yeah, I guess…
Oct 29:
Good afternoon. Let's start. One of you asked about the reading assignment. Let me
check.
So, we started a new chapter.
Right?
Then, so, did you finish to read all, or…
Oh, okay, okay, okay. Since this has been October 26th… Can I open this?
Alright. So now you have a reading assignment, as you wish. You're gonna read
anyway, right?
the reading assignment. Someone asked in Piazza, this…
week you don't have a reading assignment, and then I just added it, okay? All
right? Out there. All right, so,
Let me clarify the piece 20 we did last time.
Actually, yeah, we… a couple of you stay late and then discuss about this.
Actually, this is a final answer, I think, for every member of this time for that
question. Let me open up the question.
Rupe is 20, I think. Alright. So this is it, right? So, average memory access time,
it will be like this.
So, when you check a cache, it will be one, okay? And then there are two cases,
roughly, mish happens. One is for instruction.
So this is about instruction time, and then the other is for data.
And then here is additional time. If it happens.
additional time, how… how additional? But then, among 100 instruction, you will
have
0-1% miss rate, right? So it'll be…
001, and then penalty is a 70. This is for instruction.
instruction missed. This is the percentage of an instruction missed, and then
missed penalty for that.
And then the other, there are two cases, store and the load, right? So load was a…
among 30 is a 2… 2 thirds, and then among this, the only 50%, right, was wrong.
Or miss, and this is the one. And then the other is a 1 3rd, but…
whatever is the mispenalty called zero. So, this will be gone.
So this is the way I think we will get the final memory access time.
Every, every access refers under construction hash, right?
It is different for data cache. And for instruction cache, when we calculate
physical, we have to do one.
Okay, so what you are saying here, so you want to change the,
the formula this way. For instruction, if it is heat, it will be 1. If it is mist,
it will be this one. This is the instruction. And then the other, the 1 plus
One's two-thirds?
70, right?
Right? There are two kinds, but this is…
So, how far this is different?
Think about it, this 1-1 is common. It's the same thing.
It's the same thing, okay? So these two are eventually the same.
Okay.
All right, so then the last one, the average number of bytes transferred, okay, so
what… what happened for instruction miss?
how many bytes you need to exchange. You need to send the request, you need to have
a reply, right? So, reply is a block, so 32. So, for instruction miss…
Is that 32 plus 1, okay? How about loadmiss? Loadmiss.
1 plus 32, right?
How about Stormis?
So, miss.
You just need to send the data, right? So, roughly, let's say, 4 bytes, okay? And
then each one, you need to have a weighted average.
So this is, the… so…
So, among 100, this is a 100… among 100, the percentage of our instruction miss
will be 0001. And then this, among 100, you have a 20, right? And the miss rate
will be 0.5.
And then, how about this? Every store you need to go, so every store you need to
go. It's just applied. So you… you multiply this, and then add this, and then add
this.
Okay Isn't it?
You need to.
rough… this is a rough estimation. I think you need to send the one word in
addition to the address. So, precisely, it'll be five, instead of… well, isn't it?
You need to have a data, and then address, when you write.
Okay, so what's the number?
Can I check?
What's the question? The… when you're adding the plus one for the load in the
store? Load. Load… you need to send a request.
And the address, right? Yeah, so one bite.
Isn't the address more than one byte?
We assume it can be fit in. Okay, data is 4 bytes.
And then address is 1 byte. That's how we assume this instruction, too. Whenever we
miss, we need to send the address, 1 byte.
Because it's a 32-bit address.
build, right? 4x is a year for you. 32.
Okay.
Sheika.
So you need to send the starting point of the address, which is a 32…
32B, so it should be 4. So you… when… if you do 4, everyone is 4. 4, this should be
4 plus 4. 8. So the first… what are the different… the two fours? One of the fours
is the address, what's the other 4? This.
store, you need to send the data. It's a write-through.
Ride-through is you need to send the data and address together.
Okay?
So you can put the clear version in Piazza, I can endorse, okay? So, if I correct
this…
4, because, 32, yes, and then 32, and then here, you send the address plus data,
so…
8. That happened every 10 instruction. This 36 happened every 20… Multiply by 6.
One instruction. So among 100 instructions, every one instruction belongs to load
miss, and then every 100 instruction, you multiply this 0.1, it goes through the
instruction miss. So, usually that… this is a little bit realistic number given.
Instruction miss is very rare, because instruction is more sequential access,
right?
So you… it exhibit more special locality.
Alright, so let's move on.
No, you don't wait. So we talk about it. When you have a write-through, what we
have, we have a write buffer between CPU and memory. So you just dump your request
there, that's it. The CPU never wait until storage is done.
Standard Because this question asks a traffic amount, how much data is going on
between CPU and memory.
Like, we divide this number by 13. Okay. This is just… This is a good question.
Let's see if I can find the space in final exam.
Depending on… I want to have a beautiful nightmare. So some question, although I
like, I couldn't put.
So, yeah, that's interesting.
Okay, where are we? So we're done with the basic techniques, so now we start to
talk about advanced
techniques. Before that, briefly, you can go over memory technology and memory
optimizations. I think here we talk about interesting stuff, so let me… I will go
quick.
But just to make sure you're on top of interesting. So, if you go to micro, or
you're…
proposal submission. I hope you went to my pro website.
And then first, the paper was about 3D stacking, and it's coming back. So.
DRAM 3D stacking, people talk about it, and then, you know, but then temperature,
whatever, and then HBM, you know, big hit, right? But now, the… it's different from
HBM. It's coming… I need to read more carefully, but the… it seems like, with the…
machine learning applications, memory requirements are huge, so we want to have
undying memory. That's all people work on. A lot of papers on that.
And the… okay.
So, yeah, we want to have a memory on that, and the memory will be distributed per
node, whatever. Then the communication is about, like, I'm the one who works on
communication, right? So the new proposal, I do remember the key idea of a new 3D
stacking is they allow having communication in that, you know, DRAM stacking.
before, it was a vertical level, you shoot the data through VR, like a meta layer,
and then you go through the vertical wire to send it to another, you know, CPU, or…
but here, you can do that vertically, even without shooting down, something like
that. It was a very circular level innovation, they spoke about it, so I didn't
catch the detail, but that was the main idea.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This set of slides, we will discuss memory technology.
Before talking about optimization techniques, let us go over memory technology
available. Remember, performance metrics is… there are two. First, latency. So,
latency is a major concern of a cache design, however, the bandwidth is a major
concern of a multi-processor and I.O.
The access time can be defined…
Kim, Eun J
Kim, Eun J
[Link]
Even further, bandwidth is the main, main parameter, the performance meter we are
looking for machine learning application.
Okay, machine learning application, we do have a huge data, and then, you know,
model size is so big, and the main is just the big, then we will just talk about
throughput-oriented, okay?
And then, for computer architecture people, we feel like a…
having high throughput design is easier than low latency design. Latency reduction
is really difficult. You really need to have new hardware, but the…
We learned so long, right? You're done with the midterm, what did you learn?
All we talk about pipelining, isn't it? Right? So that is how we improve bandwidth.
So even machine learning applications, even you read the MLCs or AI conferences,
they do talk about pipelining, all the time, pipelining.
Because of truth.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
the time between read requests and when desired word arrives to the CPU. The cycle
time is minimum time between unrelated recasts to the memory.
Note that for cache, because it should be low latency, we use SRAM, and for the
main memory, we use DRAM to provide high bandwidth.
DRAM used one transistor to store a bit information. Charged is 1, discharged is 0.
However, SRAM uses 6 different transistors to implement 2 over NOT gate. So, NOT
over naught is a true value, right? That's the way they store 1-bit information,
which requires 6 transistors.
The benefit of having 6 transistors, not overnight, is a stable, so it requires a
lower power to retain the bit, and the read doesn't need to repress.
However, DRAM uses one transistor. As time goes, that the capacity charged can be,
you know, drained, so we need to do refresh periodically. So it's usually 5
microseconds is a big overhead, and also, any reference, when you read, you want to
Refresh one more time, because it really discharges some of the things there.
The way… DRAM accessed through… address lines are multiplexed. So they have upper
half address is called row access
stroke. The lower half of our address is column access stroke.
So usually, the internal organization of a DRAM looks like this way. You have read-
write recast coming, and that address will be demultiplexed to row and column
address, and there are multiple banks, and then it will be referred. Usually, in
front of memory banks, they have a row buffer.
We know with Andage law, we can double.
The number of transistors, every one half a year.
It helps to improve memory capacity, more transistors. However, it hurts memory
speed, because you have a huge capacity to access one of the world from huge
capacity. It takes a longer time.
So, the memory capacity and speed has not kept pace with the speed of processors.
So it was a big problem. There are…
several innovative ideas explored in architecture level to fill the gap. First
thing is, we provide multiple access to the same rule by having rule buffer, so
once you read, you read the entire row instead of a word you are looking for.
Kim, Eun J
Kim, Eun J
[Link]
So, attacker, if I connect to most current research topic, is a
So, in the earlier figure, you see the… we, demox row address and column address,
right? But then, because of locality, once you read all the one row data, we… every
bank has a row buffer.
So, next time, go to another column over the same row, it'll be hidden in the
buffer, so you can just supply that data without existing memory. And using the
exploiting existence of a row buffer, that there is a row hammer attack.
So I don't know the detail, but that's the existence of a row buffer, and then they
kind of observing the hit time there, so that is being used by a techer, row hammer
buffer. So, the row buffer is there to
improve performance of memory, then again, the… the attacker use that kind of
things. And then… then what we do, usually, the idea of, the… the…
hiding those differences, even it is a hit, you are holding, okay? You are holding,
like, a robot for heat. But then you pretend it's not hit, you just stall as much
as memory access time happened. Then the user cannot see, right?
So that kind of thing. And then performance problem, right? So that's all our
thing, and then, oh, without minimizing that kind of a holding time, can we still
provide the security?
That kind of things are going on.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
And the later accessor will have heat in the row buffer.
The next attempt, we add a separate clock to the DRAM interface. That, later on,
coupled with the double data rate. So usually, when you have a clock period, only
rising edge works for read and write. Here, you do read and write for both rising
and falling edges, so you…
Kim, Eun J
Kim, Eun J
[Link]
That's a DDR. So from that, we call that memory as a DDR, double data rate.
Why is that better, or worse, or in any way different from just doubling the clock
rate?
Oh.
Okay, so… so this is the way. We cannot… so we do our best to speed of memory
technology. So, let's say memory clock speed is this long wave, okay?
we can… we do our best, but this is the best we have, okay? So, then,
traditionally, what we do when we have a clock wave, only rising edge, we do right…
we do something, right? But here, your CPU is much…
quicker. The clock wave is much quicker. So you want to reduce the mismatch between
fast clock and long clock. So what they do is, the changing interface, we allow not
only rising edge, we do falling edge, too.
So…
the… is the clock of the CPU in this double data rate the same as the clock for the
memory, or are they just… No, no. They… when they say double data rate is a
doubling of memory data.
memory technology. So, what falling edge… this… that's the falling edge of the
memory technology? Yes.
Yes. And why is using the falling edge of the memory clock different than doubling
the clock of the memory and only using the rising edge? Oh, okay, that's a very
good question. So, when we define a clock, right, so that depending on how you
define one atomic operation.
Okay, so here, I forgot the detail, but then one atomic operation, you define the
clock cycle time of memory. However, this, by doubling the clock, it is more of an
interface with the CPU. So you have a big buffer.
between, so it's a buffering, you know, accommodate different two clock field, but
then your… your… let's say it's a writing takes up all clock cycle time, however,
you're fetching the read, it takes a… it won't take a long clock time, then you can
do doubling things there.
So, like, my question is…
If, how is this different from operations which need a full clock cycle take two in
a doubled clock, and operations which… Yeah, immediately I thought that way. So
what's the answer?
Simultaneous read and write, you can do it at a time. That's the benefit of having
a DDR. But if… what he's saying is, if a simultaneous read and write can happen,
why not having shortened the clock cycle time itself? Like, you can have one read,
one write, like that, right? Maybe imbalance between two latency.
So I don't go detail, but that's, like, in order to double actual clock speed, you
need to have, the same latency for different operations. But what it is, you define
clock cycle time more rusely, but inside, you have some room to do scheduling.
to do, like, oh, you finish early, and then you do other things with another clock,
something like that. It's not even…
out latency. If it is all even out, yes, of course, you can double the actual clock
speed, right? But DDR is a very, well-known technique everybody nowadays use. It's
been there for a while.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It can double the data rate.
The another technique, we will talk about it later, is when we have a separate
clock, we can have a bursting mode with the critical word first. We can read the
big data, and then… but the word you are looking for, CPU you're looking for, you
can put that in the beginning of that package, so the critical word arrives to the
CPU much faster.
Also, transfer time can be bottlenecked, so we can widen the interfaces.
Kim, Eun J
Kim, Eun J
[Link]
You learned this from our quiz example, right? If we have a multiple port, multi-
bank memory, where you can have a big data ready, but if your bus width is narrow,
you need to serialize, so that's one thing you can do.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Last thing, we are exploiting huge capacity, having multiple banks. Each bank
operates separated, so we can provide a parallelism.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so I think the next one is more advanced technology.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we're going to discuss about the technologies used in a
memory hierarchy, specifically in building caches and main memory. These
technologies are SRAM, DRAM, and flash.
Let's talk about, briefly, SRAM first, before we begin with the DRAM technology.
Kim, Eun J
Kim, Eun J
[Link]
So for this, would you listen to yourself without me? I think this part, you can…
you can listen and read, right? These are all, you know, trends, and then, you
know.
Oh.
I want to talk about this.
So, earlier part, I think you can read, right?
Okay, well, let me… let me go this part, and then…
Earlier part, you… you should listen, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Active memory, it will be, around over 500 megawatt.
Let me introduce to you most innovative advanced memory technology, which is SAPT.
embedded DRAMs. So DRAMs are stacked in the same package as a processor, which
means you have a logical layer. Note that logic layer, logic technology, and DRAMs
are two different techniques, but they are successfully put them together. So there
are two different big companies. You can see. Left side is a vertical stacking.
is explore the micron company, HMC, and the right side is HBM, high bandwidth
memory. They are using Interposal.
And then they do stack up DRAM, but the logics are kind of set aside. It's not
vertically stacked.
Here, because this memory side, you have a logic layer, you can do a simple
computation. So, which it can be known as near data processing, because…
Kim, Eun J
Kim, Eun J
[Link]
So, do you see that this color more like a purple? That is a logic gate, an end or
not gate. And then the green one is DRAM, okay? It's a stack. So DRAM's 3D stacking
is, you know, successful. And then, the problem of this
is called HMC, but then by micron, if you look at micron stock, it's really high
now, even. Because it's, like, we have… that's only…
memory company in our country had. But then, you know, this, HBM and the Samsung
other place, right? Hynix.
So, the problem at the beginning, HMC successfully released their prototype, but
then
What is the… what would be the main problem of this 3D stacking? It is possible.
You can have a… from EE people, EE, circuit people, you know these two technologies
are different. You cannot put them together. You have a meta layer, you draw
silicon, that is,
That is, logic gaze. And then.
in order to put the 3D DRAM, it's a totally heterogeneous, different technology,
right? So it's not easy to do that, but they attempt to do it, they are… they were
successful. But then what… why now we don't hear HMC, but only HPM?
Hi, Ben, nice to meet you.
And guests?
It was too expensive.
company don't buy it, and but then, you know, the HBM, because technically it's so
challenging to put the logic layer with, DRAM stacking, so what Samsung did, they
used,
interposal. So it's more like,
Passive communication is implemented there in the medium itself, so they put aside.
It is not that difficult, right? You can put logics here, and then the 3D stacking
there. So it's not off-check communication anymore, it's in… through Interposal,
they can communicate.
So we call it 2.5 stacking. So it's not 3D stacking, but it was a huge hit, right?
So every company, like, NVIDIA,
And Apple, they want to have this HBM. Any accelerator company, they want to have
an HBM, because nowadays, the language model is more the memory-intensive
computation.
intensive application, so they need to have nearby memory, and this…
Computation power nearby memory is so good with machine learning applications,
because machine learning, what you do, you… you learn… you're taking machine
learning class, right?
It's nothing but… what is it? When we draw the network architecture, there, thatong
means machine learning architecture.
like a CNN, DNN, nothing but this, you have many, many layers and weight, right?
All these are, like, when you have one point, this is a weighted sum.
All the computation, except the pooling layer inputs, or the beginning CNN is more
filtering. It's the same thing, multiplication and addition, right? So those,
computation is so simple, you can have nearby data.
this becomes really hot, and then, people all jump, and then this time micro, even
not only the HBM, even there are some more 3D stacking technology was introduced,
so people are very excited.
This is a trend, it's a really hard one. And then the… at the beginning, when I
shared this idea with the real, you know, circuit people, they were, like, so
skeptical.
Oh, no, it won't be possible because of a thermal issue.
Okay, but then we… so even when… at the beginning, when even we are saying 3D
stacking, the number of floors, only 2 or 3. Okay, because of a thermal issue, now
we don't have to worry about that much.
The logic is, yes, having 3D stacking is hard, but the memory, it is possible
through VIA.
So… So I'm aware of this kind of a huge trend going on.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Because moving nowadays, all… most of the applications are data-centric, a lot of
data movement required, if you compute.
Kim, Eun J
Kim, Eun J
[Link]
only.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
side, but if you have this logic layer exploited, you can do a simple computation
in the memory side. So this is one of my research areas I worked on, and I'm
working on.
Nowadays, everyone uses a flashcard, right? Actually, it's a type of.
Kim, Eun J
Kim, Eun J
[Link]
Later part you can read, okay? Flesh memory, you know, right? It's a… it's the EEP-
ROM, and the erasable and program a random memory, so it's, yeah, it is… you can
remember this way.
It's fast like memory for read, but it is slow like I.O. for write.
Okay, so we are using flash memory as a cache of a disk. So you can put flash
memory, SSD, in the middle of memory and I.O, so we introduce one more layer in the
memory hierarchy.
Yeah, again, you know, Samsung doing so well for this flash memory, as you find a
lot of work done by, you know, those companies.
So, you can read about it, and then, let me… I think at the end, it shows us some…
Okay, so the dependability is a really big issue when we have a huge memory, okay?
We talk about reliability, right? So, remember that kind of thing. So multibank is
a good
Why? Because you chop… instead of having one big memory pool, you make a multiple
of them, so if error happens, you can isolate easily.
So that's also the trend of nowadays when we build a big system.
Okay?
Alright.
So, later on, there will be some papers, talk about how to use HBM, but it's
already kind of an old paper. When I put them, it was a very new one, so I will let
you study about it, okay?
So these are more memory, and then we will start the advanced techniques.
2…
reduce memory access time. So from now on, we will learn 12 different techniques,
and each one, okay, try to articulate what it is, especially identify whether it
helps to reduce heat time.
What else do you need to reduce? So, average memory access time equal hit time.
plus mis rate, so you can reduce miss rate. Multiply by…
This penalty, so three things, okay? So, you always need to be able to identify,
okay, this sickening.
what will be reduced? Okay, if none of us read, then why, how it helped? Okay? So
the main thing, it will be like that.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
From now on, we're gonna discuss
And advanced memory hierarchy technology to reduce average memory access time.
This set of slides will only focus on the technology to reduce heat time first.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so before going, let's think first.
This is the way you study, right? Think first, and then compare. So how would you
reduce heat time?
Okay, fourth, true, fourth, OX question. Big cash, small cash.
That's fine. Okay, some more cash, right?
And then, fully associative or direct.
Direct. Okay, so you can reduce heat time.
What's the problem?
be a raid, actually, right? So, what would you do?
Yeah, you want to have a simple, very small cache.
And then, yes, it will reduce heat time. So how much heat time you want to reduce?
What's your goal?
Did you look at Intel machine? What is your… their first level, cache size and
weight? Maybe it's your homework.
No, no, no, it's a hierarchy, I'm asking a level 1.
I'm talking about… oh, I gave already a hint. Why I ask about level 1 cash? What's
the matter with level 1?
We want to reduce the hit time of the last level cache? Yes, right? Is it really,
you know, important?
Why reducing heat time of level 1 cache important?
I'll be that way.
Yes, it's a hierarchy, but faster.
CPU access L1 cache first.
So CPU access at one cache first, right?
It's true.
There is, like, instruction and reconstruction templates. So you have an
instruction and data cache?
Alright, let's go back to Risk 5.
Tell me, what do you have?
Like, for mid-term, what do you prepare? You learn the Tommaso law and hard of
speculation. All… what is it?
Hierarchical? No!
Dynamic… okay. Pipelined, isn't it? Pipeline! Okay, tell me, risk 5, 5 stage of,
Pipeline architecture. What is it? Fetch, decode, execution, memory.
Memory is a one-over stage. Can you see that?
Which means it should operate with… What?
Frequency of?
CPU, can you say that?
Soul?
stage, even for much so long, even hardware speculation, you have memory as one
pipeline stage, which means that should be operated most of the case
with CPU clock cycle time.
Huh? Yeah, fetch, very good. Yeah, fetch also memory access, right? So it should
operate with a CPU club.
It's very fast, right?
So… let's say you're level 1. If it is a heat.
Issue to be within clock cycle time. Can you see that?
If it is a mist, yes, I will, you know, stall, and then many cycles, and then come
back, we can go back. But most of the case is the heat, it should be within
Clock cycle time of a CPU, that's the goal. Okay.
So, then you will… with this technology, you can, you know, you can use a
spreadsheet number, and then you can decide. Even cacti model. C-A-C-T-I, okay,
write down your… some of a textbook exercise required to use that cacti model.
And this time, Michael, the big guy, talk about, brag about cacti model, and then…
I don't care, actually. Anyway, so, it's a very old tool.
everybody can use. It's just a spreadsheet, okay? If you give a SRAM size.
And then it will give access time, read time, write time.
Okay, then you have your CPU clock cycle time, right? So you can find what is a
proper number, big size of SRAM you can have. And then even you configure, okay,
this is a two-way or 4-way, it will give a memory access… the access time of that
cache, okay? Then you can find the configuration which
time, excess time, is smaller than or equal to your CPU clock cycle time. Can you
see that?
So that's it. That's what we want to do.
Then, tell me what you gonna do with the… Higher miss rate.
this is all about coming. Next one. And then that was mainly, many, like, the last
10 years, I asked students to implement for their homework for
Block size? Block size?
Bigger? So you want to have a bigger blog in Level 1?
Yeah, you can… you… quickly, you can run the simulator… simulation. I'm not sure
whether it always helps or not.
Because when you have a special locality, bigger cache is… bigger block is good,
but number of blocks you can have in limited cache will be reduced.
Right, so if there are some… the window of a walking set is different, then you
will have a miss, right? Yeah, like, maybe for instruction cache, it might make
sense, if it's not jumping too much. Yeah.
Okay, what else? What can we do?
So what she can do?
Can you repeat, please.
Okay, so if it is… very good point. So, let's say if we have very simple cache.
to match with the CPU clock. Okay, this one, nobody forget, right? So your number
one goal for level one cache is matching CPU clock cycle time.
Right? So you're gonna have very simple cache, which means mostly directly map or
two-way. Do you have room to improve replacement? Yes, you will do the RRIP for
your homework 4, right?
Mainly, the benefit of a replacement, better replacement policy coming from higher
associative cash, lower level.
But higher level, we don't provide a high, like, a high association of cash, the
small one.
Okay.
So, mainly, if you have a small… okay, I will give you a hint. The name of that
scheme called victim cash. So, what are you going to do with the victim cache? So,
victim… let me identify, define what is a victim. Whenever…
you have miss. You bring a new block, and then let's say it is directly mapped.
The one already there become victim.
You kick it out, right?
But then…
sometime soon, it is needed again. Do you know what I mean? Because… because, like,
that is a problem of directly mapped cache.
Do you remember, if we have one spot for A, like in Evan's library example, I
organize my table, my cache, directly mapped, so A goes A, B goes B, like that. So
if I need the two books, one author, one is Aaron, they need
be in A's part, but I can have only one book, right? They're all the time having
myths, conflict myths, right?
So what we can do? We have a victim cache, small, small, but when I put Aeron, I
put Adam in that place.
Okay, then if Adam is deferred, I switch back. Can you see that? We call it victim
cash. That's all.
Alright? So we can do quiz.
Alright? But, let me summarize the idea of a big tank edge here.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Let me briefly introduce…
Kim, Eun J
Kim, Eun J
[Link]
LCD video of Google?
Okay, any better idea? How it is different from increasing associativity?
Not specific to one…
set something coming out. It's… it could be anything which is coming out, but it's
like a buffer. It doesn't have any indexing into the real time cache. So, if you
change a cache itself as an associative cache, do you agree, hit time will
increase?
So, and then remember, when we have a big, biggest, big difference, I, I don't
know, earlier slide, I think I briefly mentioned. So, if directly met, what it is.
Okay, so directly, with the implicit cash index.
While you are doing tag matching, this is a… Potential candidate.
Do you see? You can prepare, oh, with offset, you prepare this word to the CPU.
Meantime, tag matching all failed, then you discard. You can paralyze. However, if
you have a two-way.
While you do toggle matching, do you know which one you should supply to CPU? No.
Do you know what I mean?
data, if you don't have a directory map, your data is available only after tag
matching. So you… the nature of access is serialized, whereas a directory map, you
can parallelize. While you do tag matching, you can supply data.
Can you see that? So it's much faster.
Okay?
But hit time will increase. The thing is, when we have a set of searches, we change
everything, so you will give a flexibility between these two. The victim cache, it
is coming. You have one directory map, but then small cache.
Whichever way, like, the index coming… the victim from, you hold here.
That's a victim catch. Okay, let me go through, introduce all 10, and then you can
go.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Introduce the advanced optimization techniques that we're gonna learn in this
chapter, before going each one in detail later.
First, you need to reduce the heat time. Especially this is critical for level 1
cache, which needs to match up with the CPU CCT time. So usually small and simple
first-level caches is used. However, you may have high conflict bins, so to reduce
that effect, open time victim cache has been used, we will talk about it later.
Also, when you have a higher associative level cache, you can have a weight
prediction to reduce hit time.
There are techniques to increase bandwidth. First, the pipeline dec. You can do
pipelining of,
Raw access, tag matching, reference, like that, and also you can have multiple bank
cache, and we can provide number-locking caches, so you still serve cache during
the miss. That is mainly increased bandwidth. It doesn't help to reduce miss rate.
or Hitai.
There are two techniques to reduce mispenalty.
When you miss the data, As you see in the memory technology.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so tell me how increasing bandwidth… so look at this, pipeline cache. So
pipeline cache, they do pipeline, like, indexing, and then tag matching, and supply
data. It's a three-stage cache, okay?
We already learned, by having pipeline architecture, it doesn't necessarily reduce
latency, right? Heat time will be increased, right?
But we improve bandwidth. So… Throughput, very good. So, in your formula, average
memory access time.
it doesn't have any room to accommodate effective throughput, right? So every
memory access time here is heat time
Multiply missed rate and the missed penalty, isn't it?
Okay, so where are… Effect.
We can capture in this formula.
Like, over multiple instructions.
amortizing.
Multiple instructions. So, let's say you have a series of instructions, and then a
lot of them load, store, load, store. Okay, then tell me how this
What it is, this? Like, so, when I say latency.
How long it takes to serve one customer.
in, let's say, TJMX, and then it is, like, if, you… those things in the table is
done like that, if not, then it takes a long time, or whatever, right?
But… how… if I have, instead of one checkout point.
If I have 10 more checked out, it will be…
faster, right? But the latency itself is the same, isn't it?
So what has been reduced there?
Execution time?
Execution time, including…
So I go to TJMX, I shop, right? And then when I have one lady to check out versus
10 lady.
The foaming… You saw multiple requests, right?
Yeah, yeah, but only to the user point of view. Think about it. So, the average
time per each person's checkout is, let's say, 1 minute. So, when I get there, if…
when she starts to serve, it's 1 minute, isn't it?
But it doesn't include any of My?
Solve?
Any… I'm looking for a terminology here. Okay, so I go to TJ's, I shop.
And then, yes, I'm in the queue, right? In the waiting line. Can you see that?
Waiting time will be reduced if I have 10 places to check out, right? Because it's
paralyzed. So here, same thing.
The instructions amongst us, there are many load store, memory requests.
This formula only express the time it takes when I'm the first one, isn't it?
When I start to use cash.
But then Cash is busy, what do I need to do? I will wait outside.
Do you remember the… even Tomasolo algorithm? If that structure, hardware is not
ready, what do you need to do? You're waiting in the reservation station, right?
Same thing. If memory cache is not available, you are load queue there, you're
waiting. That doesn't
have it in this formula. So waiting time will be reduced, okay?
Waiting time in the queue, okay?
So other than that, it should be all three, so it's a hit time, and the miss
penalty, miss rate, right? So all, but this one is not in one of the parameters in
average memory access time.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Once… because it takes too long to activate one row, once you access one row data,
maybe bigger block better, so they can transfer big data. However, it takes a
longer time to transfer, so they put critical word, the byte CPU looking for first,
so that
memory can supply the critical word first to the CPU. Another thing is you can
merge right buffer to
reduce the number of, write requests to the CPU, to the memory.
To reduce miss rate, we can do 3 different compiler optimizations. We will see with
the example, okay?
When we have a miss.
if we can predict that this kind of a miss is gonna happen, we can pre-fetch,
right? So…
Prefetch is used to reduce mispenalty or misrate through paralyzation.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so this time, I saw two separate sessions in micro on Prefetch.
So, it seems in microarchitecture, like, a classic
topics in computer architecture, prefetch become really important, I think,
because, again, it's due to application. The machine learning application, big data
application.
Prefeci really work.
Prefetch means if data access is regular, you know what data will be required next,
right? So, you can bring it earlier. So, this has become really hot.
What's the advantage of prefetching over having an out-of-order window such that
loads are far enough in the future that the,
that they load in advance, like a prefetch would do. So, load in advance means
that's a prefetch? Like, if you just get rid of prefetching entirely, and you take
your out-of-order core, the loads will happen
potentially far in the future of the… when the arithmetic units are ready. But what
is the difference between that situation and having a prefetcher?
Hmm… So…
In dynamic scheduling, you already see we try to put load as early as possible,
okay? Then, you reschedule, right?
But however, we tried to give enough time to get the data, but that does enough
time based on hit time in the cache. If it is missed, it's a 20 cycle, even if it
is in second level cache. If it is in memory, it's a 100 cycle. You cannot…
So, the window of a dynamic scheduling is not that wide.
So what we are going to do is.
Okay, in the code, you see load, then we know it is not in the cache, and we know
it takes 20 cycles to bring. So, compiler, there are two. I don't want to teach
this, but it's coming. So, compiler, software approach, you put that load
instruction, like you said, you statically change the order of a
Could put load 20 instructions ahead.
Okay.
Then you, you load that, okay? The hardware is a more blunt… whenever we have a
demand recast.
cash miss, instead of bringing one block, we will bring two blocks, or four blocks
together, streaming. That's the hardware, very simple way. But nowadays, if you
look at paper, they try to,
calculate striding. So, if you look at the matrix multiplication, then when you…
this data goes by not row, but column, in a column, then you know the row size,
right? So, you are adding
for example, is integer data. Integer data is 4 bytes, and then your array size, 8
by 8. So, in a row, you have 4 multiplied 32, right? So, if you access this data,
this data address plus 32.
plus 32. Can you see that? That's a striding pattern. So you can… calculate that.
And then you know what will be the next data access. So that… with that
calculation, you can do prefetch. So we will talk about later. Okay, so the… the…
you can think the windows of a dynamic scheduling is such a limit compared to the…
this technique.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
First, let's look at how to reduce heat time.
Of course, you can reduce heat time by having small and simple first-level cache.
For example, directly mapped cache. Why we want to have a small and simple first-
level cache? Because the first-level cache's most critical attribute for the first
level cache is matching speed with CPU.
As you know, clock CPU speed is very fast, so we cannot accommodate a big cache,
which takes a long time to search right block, and we cannot provide a higher
associative cache.
So let me illustrate what victim cache is. So let me draw here.
So if you have a small but simple, let's say, directly mapped cache, okay, there
are
This is happened due to conflict, right?
Unfortunately, your working stack has a pre-contact
Conflict misses. So then, most of the time, number of conflicts are not that big.
So what you can do, you can have a small, but fully associative cache, for example,
in this picture, 4N3.
4 entity fully associative cache, so whichever kicked off from first-level cache,
you have that in the victim cache.
Then later it is referred, then it can go back to…
The first library cache. So we call this as a victim cache.
So it's a very popular technique, where you have very simple and small cache, where
you cannot avoid the conflict misses happen due to simplicity.
Let's look at excess time. Of cache, another word, heat time of cache, varying L1
size and associativity. Look at y-axis is relative access time in microseconds.
While the x-axis, you change cache size bigger as it goes to the right side, and
then changing associativity. As you can see.
It is normalized to directly mapped cache with 16K byte cache, okay?
Let's look at one first set of bars here. When we have a 16 kilobyte cache, as we
know, we increase… as increase the associativity, it takes a longer time to get a
right byte. Why is it? So let me draw quickly here how associative cache look like.
for example, let's say it's a two-way, okay? As you know, two-way, you… for the
same index, you have a two block in a set, right? And then, to figure out whether
it's a hit or not, you need to compare tag matching is hardware. We can make it
double tag matching hardware, right?
We assume that. Let's say we have each set has a separate tag matching logic, so
you do tagging matching in parallel. So if one of them is a hit, right? So until
you… you're done with tag matching, you don't know which one is the block you are
looking for. So it goes to, actually, MOX operation. So you are having data, right?
from the cache. However, whether from the tag matching logic, you need to choose
either the left one, right one, based on the tag matching logic, right? So this is
a critical path, and you need to go through tag matching and go through Mox delay,
and then you will have a
BlockBeady.
Whereas a…
Directly map, without actually doing… finishing tag matching, you know if it is…
this is the word, right? You can parallelize it. You can supply a block while you
do tagging matching. Once tag matching is done, you can position to the right by
the offset, you can supply the word. That's why, as associativity grows, it will
take a longer time.
And, when you increase the cache size, they only compare,
Kim, Eun J
Kim, Eun J
[Link]
We really want to, avoid the delay of tag matching.
suppose an 8-pageative cache, right? So, the victim cache should also be, like, if
you take, should be, like, if you, like, want to be better than the 83 cache, then
tag… the victim cache should also be less than size of field, right?
I mean, we have to do the same comparison. Yeah, you can think, like, that way, at
least. So the thing is, remember, we call it a working set.
Okay, so with a certain observation window, what are the different data you are
using?
So, if, that working set to the same cache index, there are 3, then your 3 entry of
a big-time cache is big enough, right?
If they alternate. However, in the window, you also need to access the other cache
index
So, it depends on your workings and how the current pattern happens.
And that is it. But you cannot compare victim cache directly to 8-way or 4-way
cache, the associative cache, because it's, usually, the thing is, when you have
one directly mapped cache.
One… one cache, if you want to have a 4-way.
you're… like, visually, you imagine this way, right? You cut it in fourth, and then
you put it in parallel. Your cache index becomes smaller. For one cache index, you
have full freedom, flexibility, however different of a set working set you can get
is reduced one-fourth, too.
Right? Whereas, if you keep this directory map as it is, but then you have 8 cache
victim cache, and remember, when you, check this directory map, like, indexing, and
then do… at the same time, you can search victim cache.
That's the way. We want to have the same access time of directly map or first level
with the victim caches. So, victim caches should be small enough.
Because it's a direct… it's a fully associated. You need to search all the tags.
then click start. The comparison starts, in the same process for both of them,
right? Yes, yes. So when you have a load, for example, you calculate address.
By the time you search a level 1 cache, also that address sends to victim cache,
because it's two different hardware, you can do parallelize.
Right? What is this, graph showing? Because it's clearly not the hardware, like,
hit time. Is there… is this some… a program that's being run, and this is the
average time the program had to access memory?
What is the graph? I think this is a real hardware implementation. They just
measure heat time.
Because in the 64 kilobyte cache… There is a big dip, right? Yeah, why is it… Yeah,
I… many years I questioned, and then we discussed, like, I don't know. Sometimes
there are some harmonical points there,
What I… such a guess? I forgot.
Yeah, so it's the open time when we realize real hardware, or you run the
benchmarks in real, very detailed hardware simulator, you won't get beautiful
graph.
Like, in my experience, my… our first ISCA paper, when I was PhD, there was… we
expected, like, a graph going up, but in the middle… so we changed the ratio
between different traffic, and then there is a big, deep.
you cannot imagine how much temptation I had.
I couldn't explain that resource, and I want to drop or beautifulize the graph,
okay? I have a temptation.
Because nobody knows, right? But I'm the one. So for over one week, I try different
way, but still, I have a big dip.
I submit as it is, and then, yes, the reviewer asks. I was honest to the answer. I
don't know why, but then I was, you know, I gave my best speculation.
So it was a 70th…
70-30 combination, somehow, because it's a, you know, you have a buffer, we
separate buffer, right? Somehow, it's become so optimal for that. I don't know why,
but it's a very dynamic… many steps of dynamics happen, right? So I… so I'm… I was
very proud. I'm brutally honest about the results.
you cannot imagine, for a PhD, ESCAP paper, time changer, right?
I really wanted to hide that, or change. But then, you know, I feel like I spent so
many times, so many hours to produce these results. And sometimes in architecture,
these things happen. Even textbook couldn't explain.
So…
I bet… so from here, in general trend, you see, only look at the first one. As
cache size bigger, your heat time bigger, right? And if you bring one set, if you,
the… increase the wave, yes, it takes a longer time, right? But then, this one.
Why? I don't know. We are assuming this graph is the critical path length in the
heart and liver. Yes. Yeah. So you… when you hit time, let's say this is a two-way,
right? So the same size, from here, you cut it half.
you cut it half, and then your index field will be shortened, right? It won't be
short, and you do…
the initial… the implicit table search. So, this… Ezra, you… You… asked…
It's not one-time hit time measurement. They run the benchmarks.
And with that combination, they measure, okay, heat time off, this one. Why I'm
saying that? Because when you have a table, you don't know whether it goes to zero
first one or the last one. In hardware SRM table, the access time is different,
actually.
If you go to the first one, it's a hit first, right? But then the last one, you go
through this. And then, depending on how you…
implement the SRM table, maybe you have a pointer to pointing to the last point,
and then they do circulate, or they always search from zero, depending on detailed
implementation, so it may be different, but I cannot explain why there is a dip
here.
It shouldn't be like that.
But they put that, this is our culture.
Okay, so you guys, I know you will have a temptation in term… for some projects,
you have some numbers you cannot interpret, and some deep happens.
This is an exercise.
Just honest, brutally honest. You all… yeah, I understand, I have a huge
temptation, right?
But,
I couldn't, because I, you know, many nights I spent time to find the reason, and
then, you know, I kind of trust myself about implementing my scheme correct way. I
did everything I can do, but I couldn't find the reason.
Yeah. So maybe somehow, somehow, when we change to two-way, it always hit the
earlier… earlier index, Right?
If not, then I don't know. Only the… because these, we are not, like, measuring one
time, it's the sequence of memory accesses happening, so we don't have any control,
the order of a sequence is coming, right? So…
But then, in general, you can see the, you know, as a way increase, but then
difference is bigger when you have a small cache, because when you have a big, big
cache, it's already hit rate is high enough. Okay, there is not much difference
between higher weighs.
So you can conclude it that way.
Thank you very much.
Isn't it interesting, you know, our architecture field? But now, I kind of doubt
people do that or not. Give you one question.
The graph here is, yeah, any associative caches, right?
So, let's say I want to place a victim cache. Where exactly would you say, where
exactly would you anticipate it for increasing memory? Would it be between one way?
So, so, so, victim cache, let's say I have very small cache, and one way.
Okay, this is a heat time, right? Heat time, but then if I measure mis rate,
misrate will be high, but then by having
victim cache, your heat time, hit rate of victim actually added into heat rate of
first level cache, because it's an all function, right? Either you have that data
in level 1, or victim cache is a hit, because you can search at the same time, in
parallel.
So your… your excess time is still fixed, right? Because you have… you managed to
have a small enough victim cache so that this hit time
within this heat type, you can search victim cache, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So, in terms of misrate, yes, you can reduce miss rate with big cache, but you will
increase the feed time of the cache. So, again, let me repeat. When you design
level 1 cache, which should operate with the CPU clock cycle time, why? Do you
remember when we learned pipeline architecture? Memory access actually is a one-off
stage, right? One-off stage operates with the CPU clock speed.
So we are hoping that 90-something percent hit rate, we're gonna have that block in
the cache, so it can move on beautifully without stall any pipeline stages.
This figure shows the relative energy percentage consumed Changing L1 size, and
associativity.
As you can expect, bigger cache, yes, it consumes more power.
More energy per read and write.
Kim, Eun J
Kim, Eun J
[Link]
And more…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Associativity consumes more, because in this.
more number of attack matching units, and bigger fat mugs, right? So…
It's an expensive design when you have a big and high associative cash.
The other way to improve hit time is to predict the way, if you have to use a set
associative cache, so that you can preset mods for the selection through
prediction.
So this prediction actually, yes, gives a longer hit time, but if prediction rate
is high enough, then you can save time.
And if you think about the motivation of cache, why cache works, right? Because we
see temporal and spatial locality. Temporal locality means the one you just refer
will be referred soon again, right? So you can simply have a tag for the most
recent
way used, right? So then we can use it as a prediction. When this kind of simple
prediction is used, for two-way, 90%, over 90% was correct, and four-way, over 80%.
It's not that bad, right? Due to its simple…
Kim, Eun J
Kim, Eun J
[Link]
But how would you predict? Do you see? So when you have a two-way, let's say, two-
way.
Okay, I need to go.
So, two-way.
What would be the… let's start with the two-way.
So you need to have just a flag, right?
Which one you predict?
Which one you gonna predict?
First one, always.
Same one as last time? The one, yes, the one just 34th, right? So, oh, okay, this
has been tagged matched, and then you've set the one.
Okay, so next time, if it is 1, before even tag matching finishes, you supply data.
During take a match, if it's not, this is one, then you change, okay? That's a… but
90%, over 90%.
It's 4 already, but, okay, I wanna do this…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
And iCache, instruction cache actually has a higher accuracy than data cache.
Actually, these ideas used, okay? Used in MIPS R100 in mid-90 and ARM Cortex A8. So
it's a commercial product, and they're using it.
Also, this idea extended to predict the block as well. So, not only waste
selection, also it also predicts the block.
So, they increase misprediction penalty if you have that block predictions, too.
Kim, Eun J
Kim, Eun J
[Link]
So we used to have a lot of papers on this wave prediction, or prediction things
in, like, 10, 20 years ago.
Okay, so what shall we do with the victim cache problem?
Your piece 21.
I meant to go… well, can I go?
What do you want?
This still, let's finish one thing.
Cute.
Oh yeah, this one.
Honestly, yeah, no.
Okay.
So, same sequence, okay? Same sequence.
So, let me just briefly draw. I think after you throw the…
Okay, okay, so it's a directly map, and the victim cache is two entry. The…
is a victim, I don't have to specify way, because a victim means fully, okay? So
you have…
Usually, when we draw fully associative cache, we put this way, in one row, because
there is no address so, okay? And then directly map how many… how many blocks you
have. Tell me, quickly.
Four, okay, so you have four.
Alright?
So, in this address, what is a block offset?
What is the block size?
One word?
One word, so two, right? So 2 is a block of the same. And then, for victim cache,
all others are tagged. Can you see that?
The rest is a tag, whereas the direct met first lab
Cache is a 2, so you have a 2 as a cache index, it's a tag. Okay?
All right? So with this first one, of course, it will be missed, right? So where is
it? It's a 442…
So, 44210. So, it's a 10442, okay? So, it would be missed, okay? And then 441…
441, and then this will be replaced. Look at this. Isn't it?
4419, so 9 is 1-0.
0, 1, so it is going to 1, 0, so going to here, right? There is something, but it's
not 441, right? Your tag is not matching, so 442 will be here.
The 441 replacement is also missed.
Okay?
And then the next one is 441B. B is, what is it, 1011?
Right?
Okay, so it goes here, and then you see 441. It's the same as my tag, hit.
Okay?
And then 442B, what happened? Tell me.
It happened hit in the victim, isn't it? Victim cash. So, 441 should go up.
441 should go in victim, and then this 442 is replaced in the main cache, and so
on, okay? So this is the way. So you can see how victim cache helps. Whenever
conflict miss happens, this will help.
Yeah, yes, yes, yes. Yeah, it doesn't count as a miss. So you can write down, hit
inflict catch, because these two in parallel you with search.
So the point of the semicache is, if you shrink the size, then there's… it takes
less time, and you can increase the sensitivity.
Because it says, 2 and 2. So, okay, let's say 42V, you follow this one. So you're
writing, let's say, 4219, 9 is 1001, so 2 is offset.
2 is the index, so 1, 0, you go here.
And it's… I don't want…
It was something about, like, yeah, the victim actually is, like, doing behind the
hell count. It's not one entry, there are two, but it's a fully associative. So, if
you put victim as this way, it looks like a zero one like that. It's not. So you
are having victim…
Without any indexing. Sorry, that's why I put this…
Yeah, so you can use… you can select that paper, but you are supposed to implement
it in GC, so that you can study. Yeah, yeah. There are a lot of caching placements…
Yeah, I guess I'll hold that.
One of them gets over there.
Or do that weights?
Store all your notes on, like, Google Drive?
Nov 3:
I understand that.
So, he, decided to work with sophistication.
Excuse me. Hey, good afternoon!
Oh, did I share?
Yeah, yeah, go ahead.
Can I start class?
You got all meet them checked out, right?
So the ARG… Is 60 actually a little bit lower than
what I expect. So usually, I try to have the average around the 70, so it seems
like you didn't have…
the assumptions I change, It took a…
longer time than usual, so I will, keep in mind, okay, when I make up
final exam, and I do curbs, so…
Don't worry too much, it's not the end of a world, but if your me time is
10 or 20. Then, you know.
I see that, like, oh, maybe you didn't prepare, or even you prepare, you don't get
at all. So, that range of…
score, I… you know.
we need to think about it. I need to talk to you, okay? Other than that,
I'm not saying you can get A with, right, any score, but, for graduate level
course, you know the protocol, right?
at first year, I didn't know if I gave a C, it's a graduate level, it's really bad,
because you can't get GA projects and whatever, so…
That's my norm. Try to, you know.
See if you have, knowledge for past this class.
The pass for this class means B, okay?
Alright? But, if you see the distribution of my course, I'm the one, I think…
Harsh one. I gave a DF, okay? If you think, oh, EJ seems nice, I will just stick to
this, and then…
If I believe you don't have fundamental knowledge for computer architecture, I
cannot let you pass, okay? I give a C, D, and F, okay?
So you need to think about it, looking at your midterm, and plan for term projects.
And some of you send me emails, but when you email me.
Tell me your meeting grade, okay? Then I have a better idea.
But extra credit?
opportunity, you have enough, because if you look at logos, midterm is a… How much?
Only.
25, okay?
So you have homework, time project, final, so you have enough Opportunities to make
up, okay?
So, the… the… I…
confess, right, this midterm, the average is lower than normal, I told you, right?
So I will take it account when I make up, final exam, and when I do curve for the
final grade.
Is it?
help you. I felt bad, you know, all these students will need to go through weekend
without talking to me, and then worry into hell, but
That is what's happening. It happens, okay, it happens.
Some years, I… I do know why it happened, although…
Pranati gave you an idea, right? GAP, those notations you should study. I even told
you, but…
it takes longer than normal time, so that's, I think, the main part. And then
Tomasolo and hardware speculation, it's a similar… same question we did, but the
assumptions change, so it requires a lot of, thought process, so…
It was like that.
So…
Final will be similar, whatever question, I'm very straightforward. If I… I… I
think this concept you should know, that will be in the final. And you should know
means you really need to understand, right? You need to practice with different
assumptions, different circumstances.
This is my style. I…
You know, I cover only a few, not… don't tell when you have an interview, right? EJ
cover only a few topics, no. I make sure, cover very important topics you should
know from this course.
Very well. That's my goal, okay? So, let's not waste our energy for other things.
Any question or concern you want to share?
Okay, we just talked about midterm. The average was lower than other years, so I
will take into account when I do curving and preparing final exam. And the final
exam, we have 2 hours, and I will… I will be, you know.
as long as the room available, I will make you more time, okay? So, you won't have
any time pressure. I'm trying, okay? It shouldn't be time pressure, the test. But
midterm, we cannot help.
Any question? Any question about Tom Project? Some of you sent me an email about,
paper selections. You can also talk to TA. Any papers within 3 years, and
especially if you choose Iska Micro as well as
HPJ, I'm fine. Yes.
And then if you have your own research topics going on, and just check with me.
And then any other cash replacement policy within 3 years is fine.
You can do that part.
Okay, so… You didn't have any problem for Chris 1221, right?
Well, you… it is… it passed the deadline, so you are supposed to submit it, right?
You know that, right? How it does.
How it works. So, every quiz question, you treat it as a potential final exam
question. This will be the type. I give a sequence of memory access.
Any techniques you learn here, I will choose one of the techniques, and then ask
you final context of a hash, so you should know how it works, okay?
All right, so let's begin with this week. I think this week earlier aligned with
the midterm, so it's short, so I'll…
I hope my plan is covered up to here, memory. Then, from next week, we can start…
GPU.
Okay, boost.
Hot topic, right?
Then someone visited Korea, and he had, fried the chicken with the beer.
Actually, micro, we had the fried chicken with the beer. The Korean fried chicken
is very famous nowadays.
Anyway, so you can, we will start this GPU, hopefully next week, or middle of next
week, and then the last topic will be CMP, Gate General Purpose Computer, and one
chip.
So that will be it. So, later topics are most relevant to the current trend. So, I
may want to spend more time to introduce current research activities and then
trend, okay?
However, in order to do that, you need to march with me. So if we… I… when I look
at you, you feel… you look like a loss, and then I need to spend more time to
discuss our fundamentals, then we don't have time to talk about advanced things,
okay? So it's up to you. You really need to…
be on top of things. I know you are busy with the homework for, and then you will
be really… you should be really busy with your town project.
So I try to give a plus IPA to the team, put extra effort, if I like your
temperate, okay?
Don't take advantage of that. And then those things, actually, you can sell it in
your resume.
think about it, you choose top over top paper, most recent years, and if you
understand these things in and out, it will be shown during interview, if you have
interviewed with NVIDIA, AMD, Intel.
anywhere. Microsoft, Google, they all have a hardware.
Divisions, and then they are doing really well, right?
hardware.
AI cannot succ… cannot be successful without hardware advance. You know that,
right?
So, let's go.
Since you did the, quiz…
21, I will start with the Prefetch in HPM.
Did you search HBM in micro?
I ask you to do that, right?
How many papers can you find on HBM?
It didn't.
So then you asked me about extra chance for… To make a mute. Okay.
All right, so there was a paper, they, want to use, HBM to support the sparseness
of a machine learning algorithm.
Okay, so machine learning application, nothing but heavy memory-intensive
application. The memory is important. So nowadays, all accelerator.
Not CPU, GPU, accelerator, GPU even. They want to have
memory nearby, okay? We talk about it, right? So…
But this is what they use HBM at the beginning, very early year. There are a couple
papers to use, and you can get some idea from here.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we will talk about the last set of advanced optimization
techniques, such as prefetch and using HBM as a last level of cache.
There are two ways of doing prefetch. One, through hardware. Second, software
compilation time. Let's look at hardware prefetch. Hardware prefetch is simple to
fetch two blocks on one miss. So you want to bring one more streamed block for the
one miss.
Think about how it is different from simply make a block sizes double.
When we have a prefetch with two stream blocks.
Still, memory organization, cache organization, is with a smaller block, where you
can have more number of blocks compared to double-sized block system.
Kim, Eun J
Kim, Eun J
[Link]
Do you… do you understand this? So, if I give a question on how do I prefetch,
describe how do I prefetch…
discuss about prefetch. Then you are… first, the sentence should be in prefetch.
There are two ways, okay? Hardware, software. This is about hardware.
And simplest one. You can come up with a different hardware prefetch, okay? There
are some papers having tables, and then every PC, right, when you have an
instruction.
And then the instruction, if it mostly is a sequential, right? So you can prefetch.
But what about you go to jump call sub-routine?
function call. Then, you record that call function, and then whenever your PC goes
there, you know where this set of code segment jumped, right? You can do… predict
on what code segment next coming. That's
complicated prefetch we can think of, but in the class textbook, they only
talk about this streaming. Okay, streaming means when you have one demand miss,
let's say block number 001, then in addition to bring one 001 block, you do
bring the next one, 1010, next one together, two blocks together. So in the slide,
I asked how it is different from changing block size ID itself to block?
How is different?
You can give example.
Do you understand the, if we do prefetch, whenever, like, I…
demanded this block, the second block together, come together, right? So, in the
sense of occupying the position in the cache, it's the same, but what is different?
If you have a block that's twice as large.
It has to be evicted together in cash. Yeah, it will be evicted together. So… so
look at this. Although, remember, when we do pre-fetch, it's with the hope that the
next block will be used. There is no guarantee, right?
So what if this block I just need… had missed has some temporal locality? I use
that variable over and over. It's nothing to do with the labor data.
Okay, so this block will be kept in my cache, right? Kept in my cache. Whereas the
other one, although with the hope I brought, but as time goes, it never…
deferred, it can be kicked out, right? However, if you make two blocks together,
what is it? Even you use a small word in B block, you're wasting your
cache space, right? So that's a problem of a big block size.
That's why you want to have a certain proper size of a block, Okay? But then…
Because of a spatial locality, you want to do prefetch string, like that.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
As you can see here, spec integer benchmark, your improvement is around 1.30, and
you spec benchmark floating point, according to some benchmarks, you could get
almost 2 times performance improvement because of
program regularity. A lot of time, floating point, you do a lot of operations,
multiplication, and array operations, so that you can achieve better spatial
locality. That's the reason you could get higher performance with prefetch.
Kim, Eun J
Kim, Eun J
[Link]
How about now, let's talk about AI.
applications.
What do you expect?
In terms of, Biserate.
So it would be, like, integer or floating. First of all, AI benchmarks. Floating,
okay, is floating. And…
It's Alarova?
Matrix computation, can you see that?
So, you will see a lot of locality.
Okay.
So, hit rate will be really high.
And the other question, do you really need a cache?
Nope.
No? Predetermined, all the data. Predetermined. So, it's not that one data used
only one time. So, let's say you do matrix-matrix multiplication. What is it?
You do one row and the one color, right? This row will be used
number of columns, right? So you have a locality, but it's not random. When we have
a spec benchmarks or all other applications, we don't know how many more it will be
used. But now we know exactly, okay, this will be similar times.
Okay?
But that is a…
you see a lot of papers, I think for your TAM project, you went to ISCA Micro HPCA.
There are a lot of tensor, whatever things, all about this, actually, I can't tell.
Sparseness means in matrix there are a lot of zeros.
Okay, we want to condense, compress.
And then we… when we do this, how many times, and then data flow, we… we know when
exactly this row data will be needed, so we want to move this correctly, or we want
to put that in the local memory in the correct time.
That's so… Nowadays, everybody work on, okay?
See, so if you do this…
part, understand well. You have enough knowledge to explore machine learning
application part, okay? Everybody works on machine learning, so you should do it.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to get higher performance with Prefetch.
Another way of prefetching is compilation time.
When you see load instruction, you know how many in average it takes, so you can
insert prefetch instruction with the same address before data is needed.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so understanding only that, what is the disadvantage of this software
prepatch?
Did you hear what he says?
After you compile, you have a load LW, right? You know if it is missed, it'll take,
let's say, 10 cycle. What do you do?
You have a… Additional instruction prefetch and address will be located 10
instructions ahead.
Okay? That's the idea. What's the dissolved advantage of doing that?
I'm always amiss, and his second comp goes…
number of instruction IC growth, right? So a lot of time, software approach, you
do… you cannot avoid the code size increase.
So that's a main advantage. Why?
If you're a courtesized figure, you would spend more time and more energy to fetch
instruction than real work.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So, prefetch doesn't… shouldn't cause any exceptions, so we should take a lot of
extra effort for that. When you prefetch data, either you can prefetch into
register directly, or you can prefetch into cache.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so, here.
When you prefetch, again, pre-fetch means it's not fetch, right?
So, in compilation time, if you have ex… make an extra instruction, instead of a
load, you have a prevention. That's a little bit different story, because it's for
sure you need the data, right?
But in hardware.
prefetch. You are hoping this data will be used. Can you see that? You are hoping
this data will be missed, that I will put extra instructions ahead.
So there are… here, the video just said, there are two ways to place the data. You
can have an extra buffer, prefetch buffer, okay?
So, first time prefetched, it will be in the buffer.
The other, you just put in the cache.
So what's the pros and cons?
Always try to think critically.
So, yeah.
Prefetch buffer, extra hardware. So if my design, like an earlier year, one of our
students want to do prefetch, and then, you know, we make a table, and then we go
through… we see the regularity, so every time, call function, we want to do
prefetch, and then those, we use a prefetch buffer. If you do that, you need to
give analysis, hardware overhead analysis.
So, buffer requirement, how much energy, how much extra area it will be required,
right?
So what is the disadvantage of having data into cache directly?
I'm doing certain demand access.
You are replacing, by putting new data.
there is… there are existing data, right? Those data are actually probed for
demanding this, but you need to replace it, okay? We call it
Cash pollution, okay? Write down cash pollution, okay? Prefectures… Disadvantage is
cash pollution.
Because professionally, you do fetch data for hoping, okay, speculation, again, is,
oh, this will be used, but actually, if not, then
Or it will be used about… far away. So you brought a data, or this will be used,
but then I kick out the data, which… which is needed right now.
After one cycle. Can you see that? Because a conflict miss, they are sharing, then
it…
polluted cash. So cash pollution is a very, classic problem I would prevent.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Out of the system, they prefetch into cache, where we worry about cache pollution,
because you brought prefetched data, it may kick out the data which will be used
soon.
We'll discuss later.
Compiler prefetch open time combined with loop anointing and software pipelining.
As the last advanced optimization technique, let me introduce a method to use high
bandwidth memory, HGBN, as the cache of DRAM to extend the memory hierarchy.
Because most general purpose processors in servers will likely want more memory
than can be packaged with HEVM package. Remember, HEVM
is good for bandwidth, but limited capacity compared to DRAM. So, here it has been
proposed that in the package of DRAM, be used to build massive L4 caches with
upcoming technology ranging from 128MB to 1GB.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so for the rest, I think you can just listen. Why,
I don't see much of a success works from these two papers.
Okay, so this is, sad thing in our community. We move so quickly.
as soon as HBM was… the idea was introduced, and then by Samsung, and then people
jumping into… and then these people, our architecture people, mostly used to work
on
Cash.
Right? So you do also work on cash, right?
So they immediately come up with, oh, this HBN can be used.
last level of cache, and then we can use the computation for tag matching, okay?
These are ideas, okay? But I don't see any success work
Following this, okay? Which means…
that some idea is gone, right? But the idea, I keep coming back, I will introduce.
So, if you search Zhong Hou An from SNU, so what he proposed is
Yeah, let me… because it's, again, similar to… related to machine learning.
So… If you use C…
the kernel data part, the one of a quarter part, we call it kernel, this kind of
kernel. You have a big array, then the computation is just simply you add constant.
But let's say it's a huge data.
capability inside.
Okay? Then… What we used to do is.
Look at this. If you have code, you did the, you know, this tomato whatever, right?
We have a branch, and then you do a fetch every instruction, and then you fetch
data, right, in sec- third stage memory, and then add.
to the register, and then you'll write back to the memory. Can you see that?
So it's so inefficient if you have a big data like that. So what he proposed, okay,
this…
things we do overload to the HBM. HBM has a small computation, so this will be done
there. Okay, so what's the plus and cons?
It's Iska paper. Of course, it's a great idea. And I introduced because there are
so many success ideas, right?
you know, right?
Performance benefit will be huge.
What is Kantz?
This is, aligned to the trend, okay, now it's coming back again, near data
processing.
Or, in all the time, 20, 30 years ago, people talk about processing in memory.
Okay? Actually, we want to do that, right?
you do processing where data is. You don't want to bring to the CPU, because in the
example I gave, Evan's library, table, your CPU,
And then every time you access memory, you stand up, go to a bookshelf, and then
bring the book, and sit down, do it, and bring back books to the bookshelf, right?
It's such inefficient… if your workload is… if your professor asks.
get all the, book serial numbers from… for topic A, let's say. Then you go to book
self to find the topic A, then you took a picture, that's it, right? You don't
bring book to the table. Can you see that? That's this example.
So what's cons?
Come on.
Why PIM has been failed.
Last 30 years, 20 years.
Why?
It's not regulated.
What's your title? Which chapter we are on? What you are learning now? What's the
title of this slide? What we are… we are working on?
Okay, memory hierarchy, key idea of memory hierarchy is…
Okay, hierarchy, okay, no, no, no, I'm looking for a word!
You'll need an additional processing unit in memory? Yeah, we have nowadays.
Processing unit is so cheap nowadays.
Right? ST… everywhere!
Okay.
What's the title of this slide, actually?
Okay.
So, okay, we are talking about advanced techniques. Mainly, we are discussing about
what?
It's all about… What is this old 12 techniques about?
Is it proven cash? Cash! Okay!
Cash. Alright, let's go to that. Okay, I gave you a hint, okay.
So look at this kernel. If we optimize this kernel operations, computation, yes, it
will be good to be happen in memory side.
Did I show you the rest of the code?
Okay, tell me. I gave you a hint already.
What happened? After this for loop, what happened?
the parallelizable code will be accelerated, and they… They gave a hint!
What was it? Cash.
Okay, so why cache works? You need to come up with an example of the code. Why… how
cache works.
Because of temporal decode. So you need to come up with a code for temporal, or I
think it's a big cache, big array, so you may want to have some code for…
Spatial locality, right?
Well, I was gonna say, because, like, you need to move the data back to your
process, like, if you need to, like, work on it again.
Like, you offload it, but then you need to bring it back, because… So, what I'm
saying here, if you… this is the only operation you do on a array.
It's okay. You just do it there. But what about a updated one used here for another
operation?
You have locality, right?
So when you look at bigger scope of observation, oh, maybe this would be better to
be executed in CPU side, because they require more computation. Can you see that?
Okay, that's the…
kinds of processing in memory, so you can search the name, processing in memory,
near data processing, all great ideas. We strongly believe, as a computer
architecture, we thought it would be coming, it will be coming, it will be coming,
and then we had a lot of papers, a lot of academic work.
But we couldn't see that coming, but now it's coming back. Why?
AI applications actually So, our ISCA paper… I will show you this, let me delete,
so…
Let me delete only, okay, so you… because you can… compare this kernel.
to AI kernel. AI moves to…
common way of, so you… AI, you do have what?
Wait… Multiply, input, will determine, then you do sigma, and then the The output,
right?
Do you see?
So… We had a paper on this, okay?
Here, your computation point is on one data, so you have HCBM,
Okay, big HBM, let's say memory banks are distributed. This happens in one place,
so you can do that in the computation nearby one, okay? However, when you have a
big memory pool, and this W and I, let's say W is here, I is here, I'll put this
here, what shall we do?
Okay?
So still, we show, if we identify this kernel, and then change the programming
model, and then compile… when you work like this, you need to propose a new
instruction, too.
So whenever we have this, it won't be normal multiplication. It will be certain way
of multiplication we
tagged it, it should be happened on the way, okay?
Which means we won't bring WI to the CPU,
we go to… we check… CPU is far away here, so this will be compiled to load W, load
I, and the multiplication like that, right? So when we have a load, a special kind
of load that tells, okay, I need the… this W,
Okay, then I…
Okay, it should be written to I. So, what we found, we can build a dynamic tree
whenever we send that special instruction. So, we found that this, okay, I… I
should go to… we can build… we can find the best place to compute. So, WI, partial
sum.
made here, only those partial sums sent to TI.
Okay, what if WI is a B, and then W… some other part of a WI is here, and then D
will compute on a way like that?
Okay, so it's a reduction happens. We call it reduction. So instead of 100… 1000W,
1000 I sent to the CPU,
We will try to find W and I, and then reduce if they meet in the middle, okay?
There's a… so we propose active routing. Router, because we… when we have multiple
knows we have a router to forward that request and reply, so you can do reductions
on the way.
And then based on this, there are a lot of people, many of a similar idea come up
with. And, these will be there, okay? I believe when we have a, you know, machine
learning application and the HBM memory pool, either DDR network.
you would do it. You, instead of bring all the big matrix to the CPU or GPU,
Or, you will have a combined accelerator with the memory cells.
These are CPU codes, the boxes which you have… Oh, yeah, so, so in our study at the
time, we had a CPU, and this is a… you can imagine the memory hole, each memory
bank.
Connected with the router, because we have a network of memory.
Okay, and then all these arrays are 2B, so it's all spread out. So when we send the
request.
and then you record that request, you build a dynamic tree, then whenever this WI
meet each other, that is a reduction point. You do multiplication, instead of
sending the WI back to a CPU. And the values of the…
weights, or whatever they are, are placed on the memory banks by the compiler, or
it's just placed? Okay, so we will talk about this. Okay, so when we have a big
data, right, and the compiler do…
Compile and link, so determine the location, and then operating systems provide
translation, right?
So, mapping, physical location and virtual address mapping happen by operating
system. So, for this work, we didn't…
touch that part. Then I talked to NVIDIA people, and then NVIDIA people said,
mapping is the most important. They found that if you have a very good way of
mapping, you can reduce this kind of, activity, so that is a later work we try to
do.
So, as you said, that in-memory processing is kind of there, so the processor is
actually something kind of GPU or something like that?
Yeah, you can… you can think, like, of customize a simple processing unit. So, do
you remember when I introduced HEBN and the HNC? So, memory cells…
the technology is not compatible with the logic dyes, right? So these HPB and come
up with, you can put it in the same dye.
through Interposal. So you, you have data nearby, and then while you read, you can
compute, you can set, right? So you can use it. But that is only true when you
have, like, this work.
Oh, I delete already. AI equal AI plus C. That computation only one place, but if
your computation requires multiple data, then data is a spread, so then reduction
in dynamic time should be reduced, implemented. Actually, similar, I…
when I work on this, BlueJean, do you know the IBM Bluejin? It used to be number
one, the high-performance computer in the world. Still, BlueJean is number one,
right? Anyone from…
Small computing.
The Earth Simulator was number one in Japan. Right now, it's, El Capita. Oh,
helicopter, okay. So, so blue gene was the number one, and the blue gene has this
reduction, global reduction, but their tree is static.
So when you have all this computation, they do distribute it, and then, you know,
they build a tree, but it's a static tree. So what we propose is these data
locations can be everywhere, so we build a dynamic tree, and only when it merge,
that point, we…
Reduce, and then send the reduced the data, partial data.
So, you… based on this, there are some other work, okay? So, I think that will be
enough for the HBM. I think you can read more.
Let's go to the next one.
So many interesting things going on here.
So, let's talk about how to increase bandwidth. So, before listen, How would you
increase bandwidth?
This is about cache design, right?
So, okay, you're here already more than half a semester. Bandwidth?
What do you do in architecture?
Pipelining! Okay, easiest answer, so you, you should know, right? We, all we do is
pipelining. And if you, like, even AI accelerator design, everything is pipelining.
tensor parallelism, pipelining, and training also pipelining. So, hardware people
love pipelining?
Yes.
And the other…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
For this set of slides, we're gonna talk about the three
The three technologies to increase bandwidth of a memory hierarchy.
10 advanced optimizations. They are trying to reduce heat time, or to increase
bandwidth, or to reduce miss penalty, and miss rate.
Last one is, try to achieve a parallelization to reduce mispenalty and misrate.
So, with this set, we're gonna talk about the second category of advanced
techniques. Pipelined caches, multi-banked caches, non-blocking caches.
First, we can make a pipeline cache where the indexing and tag matching and supply
the block with the offset, all those can be pipeline, chapter small stages
pipeline.
Remember, as we learned in pipeline architecture, it doesn't reduce execution time,
latency of one access.
It may increase the execution time latency of one access. However, with pipeline
architecture, we improve throughput.
The throughput improvement usually can be measured through waiting time. We will
have less blocking time due to busy cache.
The examples of this system Pentium 4 and Pentium 4, they have, 2 or 4 cycles.
As you can see, if we have a higher associativity, then you have a longer time to
take, tag matching and boxing, so it's easier to
Implement pipeline and cache.
So… With pipelined cache, we can afford more flexibility with high associative
cache.
However.
it will increase branch misprediction penalty, because it's in the pipeline stage.
If branch was wrong, then it won't be that simple, like, one cycle memory access.
If you have two cycles, then there are two ongoing accesses going on, so it's not
that easy to make flush effect on that.
The second optimization technique to improve bandwidth we are going to discuss is
multibanked cache.
We should organize cash as independent, different banks to support this
parallel accesses. So, in this picture, you have 4 banks, so you can have 4
concurrent memory accesses can afford it.
So that, this idea has been
exploited by on Intel i7 architecture.
Note that this multi-banked cache is different from, anyway, set-associative cash,
although they look similar, okay? So let me explain how multi-banked cash works
first, and then I will contrast it with a set associative cache. So when we have a
full bank.
let's say we have a 32 memory address, and then you know with the block size, we
know how many bits, so it will be used as a block offset, right? Then usually the
number of bits is… will be used as a cache index. Here, when we have multi-banked
cache.
According to the number of banks we are having here, we have 4 different banks,
right? So to indicate one of them, including four, we need 2 bits. So 2 bits in the
middle will be used as a bank ID, okay? It will indicate which bank it should go.
And then, in each bank you see in this picture, there are
4, right? Block. So, your cash index will be?
Two bits in the middle. The rest will be used as a tag.
Note that here.
According to this 2-bit bank ID, you have designated the bank to go, okay? And with
a
cache index, you know which row you should read, and then you do tag matching.
Whereas, if these same four tables used as a four-way set of table cache, the
placement will be like this, with address, your offset will be same as some cache
block size, and then.
Here, each row you see in the table, there are four, right? So 2 bits in the middle
will be used as a cache index, you already know. So this will tell which row you
should read, and then you are having four candidates. So you are having four
Tax, shared.
stored, it will be compared with your address tag, and the one of them hit is a
hit. You got it? It's different. Multibank cache
With a bank ID, it can go only one place, and then while this bank is serving the
current address, other… as long as other addresses differ, going to different bank
can be served at the same time. So, parallelism achieved, that's why we improve
bandwidth with the cache.
Kim, Eun J
Kim, Eun J
[Link]
Is it clear?
Okay, so bank… it's a multi-bank, you have a designated bank. How do you know? It's
a bank ID. Do you understand the bank ID comes first than CI? It's a cash index.
What happens if we change the order?
What happened?
So, when we use low…
2Bs as a bank ID, means interleaving happens first, right? So, if you look at block
ID, block 0, and it goes here, 1, 2, 3, like…
Right? But if we switch this order.
CI first, and BankID later. What happened?
You… yeah, so it will go through 0, 1, 2, 3, So what…
What's wrong with that? Why do we do bank ID first? Interleave first, between
banks? The consecutive block? You access consecutive blocks, you can have
parallelism in the banks. Yes.
So, if you have a 0, 1, 2, 3, 4, like, all spread out, a lot of times the spatial
locality means you do consecutive accesses to the nearby data, right? So, these
four banks can be BEG at the same time. You can achieve parallelism, okay?
That's why we do have a bank ID first.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
achieve access cache in blocking mode, with blocking caches. Normally cheap, that's
why we improve bandwidth with the cache.
Let's look at non-blocking caches. Normally, cache operates in blocking mode, which
means when you access cache, other incoming recasts will be blocked until the
current one is served.
So…
Let's say if the current one is missed, means that mis recastle will be related to
lower-level memory hierarchy, L2, L3, or memory, which takes much longer time.
Meantime, you won't allow any later recess to come into this cache area.
So, the waiting time for this catch will be high.
That's the idea of a number blocking cache. If we operate cache number blocking
way, what it means, when you have a miss, while that missed request relates to low-
level memory hierarchy, you open cache available for later requests. So open time,
it is referred as a hit under miss. You allow hit under miss or hit under multiple
misses.
Kim, Eun J
Kim, Eun J
[Link]
Do you understand this part? So, you can have a cache designed, blocked away, or
non-blocking cache. So, this used to be homopho, okay?
But then, a lot of students struggle.
This is… idea sounds very simple, but implementation is not easy, so listen why
it's not easy. It touches every places of a cache to…
To support the dysfunction.
So you can think like this, there is an agent, like me, and then someone comes. So
if it is a hit, means I just serve, and that's it, right? But if it is missed, or I
cannot do it, I should ask another person to do, okay?
Then I ask that person to disservice.
Blocking means while this finish, I'm waiting. I do not… I do not accept any new
requests. I'm blocked myself to serve currently.
Recast. Can you see that? That's a blocked. So if you look at GSIM, the default
cache is a blocked cache, blocking cache. It won't allow multiple access to the
same cache.
However, so let's say I want to provide a number blocking. Means, oh, you recast,
and then, oh, I tried to serve myself, but it could… I couldn't, so I ask agent.
then I will be ready to take a new request, right? However, I'm short in memory, so
when it comes back from agent, I need to know which… so there are customers, right?
Which customer I should have provided this data? Can you see that?
We need to have register to remember.
That is called… so, write down, okay, because whenever I put this in the binary
exam, students
have forgot about this, okay? So if… so earlier, I gave this paper so students know
this terminology so well, but then in the textbook, we briefly talk about it. So,
that is MSHR.
So…
modern CPUs, everywhere they have MSHR. And even we propose using some… the
unknown… the unused MSHR for networking, something. There are some other papers,
because MSHR there. MSHR means miss…
Status?
bowl… Ding.
Register.
Okay.
So you can think about this. One customer. Customer ID is a PC value, right? Load,
come, I check, I don't have data.
I forwarded this miss to agent. But I need to record when this data coming back, do
you remember? I need to know this data for which PC. Can you see that?
So I record, register, register, and a PC, then address. Then when data comes back,
I look at the register status, then supply data to that PC value.
Okay?
So, if you provide a heat, Swan.
So, if you provide… you try to serve this cache, allow heat under one miss, then
your MSHR, you need to have only one, right? Okay?
If you want to provide a hit under multiple myths, like, oh, it comes, I can't, I
ask, and then there is another, oh, I don't have… so how many outstanding, right?
It will be limited by number of MSHR.
Okay.
And then the other thing, also, MSHR, this paper, if you read carefully, so I sent,
okay, block… block number 100. The next come, it was a 100.
Maybe it is the same word, but in the same block.
then what do I do? I don't have a block, but it will come in, right? So I only
record… append that PC to that block address. Can you see that?
So…
you're… so this is a code segment, and then this PC [Link], and then you…
you need the data, address 2000, okay? So I send that request to the lower memory.
Meantime, next, load the COM, and then it's,
So, PC was a 2004, you were 2000, 2004, but then, when I look at block number, it's
a different word, but in the same block than what I do.
I don't need to send a new request to the lower memory, right?
Because I already look at it. Only thing I do in MSHR field, there are a series of
PCs.
So original first recast 2000 will be recorded, but next recast to the same block
will be recorded 2004. So when I have a block returned from lower memory, I supply
this block to both
PC, okay? So that is a function of MSHR.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So usually L2 must have supported this.
In general, processors can hide L1 mispanel because
L1 miss, and then there are a higher chance L2 has data. However, hiding L2 miss
penalty is really hard. That's why it takes a long time to gather data from memory.
If L2 doesn't have it, it is too long, so you want to provide the non-blocking
caches for L2.
As you can see in this figure.
We can change the number of misses you allow hit. So 1 miss, 2 misses, and 64
misses. Note that as you increase outstanding number of misses.
more, you need to have a higher number of registers to keep track of those
outstanding misinformation, which is called MSHR. So, let's say we have two misses
allowed, then when the low-level memory returns with the recast, then you need to
find one of a
item in the register, okay? Then you get the offset information. From there, you
can supply the word CP you're looking for. Note that this low-level cache won't
guarantee in-order service, because let's say it is L2, you have a number blocking,
so it will be related to L3. If it is missed again, then it will go to memory.
Memory takes a long time, right? So, the first one happened
have a hit in the L3, whereas
No, the second one has a hit in L3, whereas the first one has missed in L3, which
will take much longer. So that's why you need to have a register to keep track of
all the outstanding misses.
Kim, Eun J
Kim, Eun J
[Link]
Okay, service will be arrived in out of order, so you need to have this registered.
at all.
Okay.
So… Let's do the crease.
Okay.
Two minutes enough, right? Now you're… you already have this sequence in binary,
right? In some area in your node.
Right?
So the… I think this shows how associative cash is different from multibank cash.
Awesome.
So click Benry.
Because I'm asking us a number of concurrent practices, so shouldn't everything be
the same?
So now, Intel servers actually multi-banked.
So, let's say they have a 16 cores, they have a 16 multi-bank cache, last lever
cache. And based on the bank ID, the location is determined. So 16, like, 4x4 mesh,
then with the ID, you know where is the last lever cache.
So we will talk about the cache coherence later, when we have… Yeah, that's two-way
associated.
First.
Sounds good.
What's happening?
direct maps.
I'm actually done.
So, first thing, we need to…
figure out this, right? So what is the offset?
How many of this?
Beautiful.
Okay, why?
So, two words, one word?
What is a word size here? 4 bytes, okay, so then 8 bytes. So it'll be… Three.
Comma, right?
All right. Then, it's a two-way associative, so how many bids do you need to have
for cash index?
What's your total cash size?
4… And then 2 means you have 4 block.
But it happened, right? Cutting happened, so your cache looked like this way, two
table, isn't it?
So a cache index, 0, 1, so 1 bit. Cache index.
And the others tag.
Clear? Okay.
How about B?
You have a… to bank, Two table, right?
To bank means bank ID, One.
Bank ID is 1, right?
So you have a tooth bank… Each bank… How many?
blocks, because your total, 4 cache blocks, so total should be 4, so each one will
have a 2. So, again, cache index here, 1.
Right? So this is a bank ID, this is a cash app.
Index, and then there's… pegocytesis.
No, text size is similar. No, different, right?
B has one short. It doesn't have a flexibility. When you have a longer tag, tag…
longer tag means more flexibility, isn't it? Go to the same place.
Alright, based on this, do it.
Isn't it easy, right?
Muhammad.
Gotcha, David.
Bank… Yes.
Beautiful.
Yeah, it was,
But they didn't tell us that word.
I think I didn't write that down.
Okay, so 18, it was… So it's not the same. Yeah, yeah, yeah, yeah, yeah.
And because… Okay, so, do you want to go over that associative first? We did
before.
So, settle frustrated means that
Let's say there is a two-way, means you have a two table, but your total cache size
is fixed, so it's a half an
But, between these two tables, you have flexibility, you have freedom, You treat
them together.
Okay.
Hedgebox, okay.
So, 442B, actually, B means 1011, so it…
It goes, here, 1.
And then you will put cold pork through here.
And then the next one is 4-4.
19. 90121 is a 9, so it's the same place, but then there is an empty space, you use
this one.
This is, going here is tag-matched, right? Okay.
And then… 1011, and then it's the same thing, right? Oh, to 8. 8.
That's true. Yes.
So, if I do first to four, it'll be miss, miss, hit, hit. Okay, but your session
will look like this. If you swap, like, 441 here, 442 the other way, it's fine
still, because it's flexible. How about this one?
442B, you need to, convert this to…
decimal number, okay, so it'll be 0010, V means 1011. So you are using this as a
BI, this is a…
cash index. So BI equal 1 means that you go this table, and then 0, so you have
44001. Can you see that?
The tags, you will hear, 442 is the hexadecimal, and then changed to 0010. The last
bit will be used as a… it's gonna be 2.
hash index, so 001 become pack. Okay, so 44 is a hexadecimal, 001 is a binary, I
just put that away, okay?
How about the next one?
4441B, 000190001. So you are using this?
this. So, angle ID for 1, and then equals to 1, so it should be here, so 44400. So
it's a miss, miss.
And then the next one, 441B00110111.
0, 1, and then… The hit-miss looks same, but the contents in the cache is
different.
Isn't it? Location is… this is… you don't have any flexibility, you have a
designated spot to go, whereas these two, it can be exchangeable. So, full mist,
both the full mist, right? In terms of hit-miss sequence is the same, but the final
contents, if I draw…
Final contents of this will be… it will be 44001 here, so 44001 here.
The other one… 442, and then 043. Because there's still 4.
Yeah. So now, like, four of these had different… now they're in the same… It's two
ways.
Oh, I put the tags. Okay, so, so, the way I just simplify, for example, let's look
at 442B,
4, 4… these are hexadecimal, but only this I need, right?
For just my convenience, I only complete this binary point. So, I only change this
to binary, it's a 0010, and the 1011. Then, following this, this is a bluff upset.
This will be…
Then, Heidi, this will be… Yeah, great.
set index, and then the rest, the 404 as it is, hexadecimal, and this binary 001
will be saved as a 10. So that's why I put 44001. It shouldn't be divided by 2.
I don't know why you divide by 2. I'm not changing them right now.
Oh, okay, okay. So you… that's not… Oh, yeah, so you, you, also, you, so you're
converting hexadecimal again. Yeah, yeah, it's all. Yeah, yeah, maybe it can be,
right? Okay.
So, let me move on. Any question on this question?
The same with you.
It's B.
It takes 3… A would be, equal.
How can we reduce this penalty and,
tags. That's why she has, like, 4…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
In this set of slides, we're gonna discuss advanced techniques to reduce miss
penalty and misrate.
So far, we discussed advanced techniques to reduce heat time and to increase
bandwidth. Let me revisit the formula of average memory access time.
As you recall, every memory access time
Of one level of memory hierarchy will be…
Kim, Eun J
Kim, Eun J
[Link]
So you guys are still working on hit, miss,
Okay, are you done with it, Chris?
You can hold the discussion and then do after class, okay? So let's go to the next
topic.
Alright.
Beautiful.
Can I have your attention, please?
Let's go, okay. Forget about Chris, you can do it later.
I know you love question, right?
Ask me if you have any questions. Okay, it's okay? Alright.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Key time of the current 2 cache.
And then, if it is a miss, miserate.
You need to go through this penalty, how long it takes to bring the demanded block
into that ladder.
So, heat time, of course, if you reduce, it will help to reduce average memory
access time, and
If you increase bandwidth, the time you cannot see in this formula, but however,
before you get into the cache, there will be waiting time. So if you have increased
the bandwidth, you will reduce the waiting time outside. Okay? Now, we will talk
about how to reduce miss penalty and miss rate.
Let's discuss
Two advanced techniques to reduce the impact of this penalty. First one is critical
word first. The other, early restart. Try to differentiate them conceptually.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so I think I need to stop. So how it will be different from the name?
critical word first means… so think about it. When we have instructions load, LW,
then R1, R2, then you calculate address, right? That is the exact position of the
word you are looking for. However, when we have a memory hierarchy, we translate it
to a long number, right?
So you go to memory, you will have a whole block. However, the data you need is…
It can be anywhere. So, critical word first means when you carry that address,
memory will read that word first, and then put it in the first, then the rest of
the whole block comes together, okay? So…
How much time will be saved?
Do you recall in earlier quiz, we have a narrow bus?
You have a 32-byte data block size. However, your bus is 8-byte. So if your 8-byte
is always first…
place, you don't need to wait all four cycles, right? After one cycle, you can
supply data. So that's much, that much you can save, okay? How about early restart?
So what does it mean by early restart?
Only restart, so it's the same.
you send the whole block, okay? But then the requested word is a random place,
right? Then, when it delivered to CPU, you need to play, you need to copy that data
to the memory, to the cache, right?
So, you're doing from the beginning.
Then you meet that location. Then what you do?
Instead of waiting whole block written to the cache, you supply. When you reach
that position, you supply that data. Okay, that is an early restart.
Which one?
Will be easier to implement.
Early restart. Why?
So you don't need to change memory side, can you see that? Critical world first,
the idea is great.
But if you're a CPU designer, you proposed this, and then, you know, DM people,
like, laugh at you, it is too much, right? Changing inside the microarchitecture of
memory. So there are pros and cons. So we will talk about the later part later,
okay? Thank you.
I think two things is that one of them asks you… the other one just says when the
important stuff shows up, forward. The second one, memory just…
Nov 5:
Yeah, I just… For allergies, I have two.
What the fuck?
One week.
But it doesn't have an Oscar.
Yes, you, you found me.
Once we get everything.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set…
Kim, Eun J
Kim, Eun J
[Link]
Last from one another, last one there.
C++ movement.
Thank you, thank you.
We have films.
So people sitting here, I don't… I don't get to know you, right? But you're always
sitting here.
Okay, good afternoon, let's begin.
So, as you can see, today we'll be finishing up this chapter, okay? Then from next
week, we're gonna talk about GPU and CPU, then the final. But when is our final
exam?
Okay. And then, when is the last day of a lecture? We have a reading day, right?
30.
15 is our final exam. When we start to have a meeting date.
December 2nd?
It was,
Oh, I can tell you the name of the company. The last day of a class is December
8th. We still have,
Yeah, enough money. Two weeks, two weeks for GPU, and CPU.
All right, so let's the finishing up the memory hierarchy. We all talk about…
These are all 10 advanced optimizations, and today we're gonna finish up the,
reducing misrate. We talk about how to reduce miss penalty, roughly. Let's go slow
a little bit, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
increase the bandwidth, you will reduce the waiting time outside. Okay? Now, we
will talk about how to reduce miss penalty and miss rate.
Let's discuss two advanced techniques to reduce the impact of this penalty. First
one is critical word first. The other, early restart. Try to differentiate them
conceptually.
The critical word first we already discussed when we discussed about memory, right?
Memory activating role and…
Column buffer with the address takes much longer time than reading data sequential.
So, what it does, instead of read from the beginning of a whole block, you will
read first, the clicker word. You can send not only block ID, you need to send the
whole address, the word address you are looking for. So then, the memory will go to
that spot first, and then read, and put in front of a packet, and then the rest of
a block follows with it.
Kim, Eun J
Kim, Eun J
[Link]
So among these two, they are similar, right? But the…
In terms of latency, you can think of, and in terms of complication of implementing
these techniques, how they differ.
The idea is simple. You wanna, you know, when you have an LW, there is a critical
word you are looking for, but memory will supply whole block, right? So you are
reading whole block, you are waiting until a whole block is transferred to CPU
side, and then you wait until whole block is written to the cache.
Then, using offset, you read that block, the word, and the supply to CPU, right? Do
we really need to do that?
Okay, so critical word first. Memory itself will put that word, missing word.
At the front of the package, okay, when you deliver.
So…
Okay? And the word… the early restart is the same thing. You have an address, but
then you will read a pub block.
Okay, the block, it has that word, will be shipped out and delivered to the CPU,
and then, when you write to the cache, you only wait until that position is coming.
If it is middle, you wait until half of…
blocks written, then when you reach the position of the critical word you are
looking for, then you, you know, instead of waiting until the rest of a block
reaches, you supply that word to the CPU. It's a little bit of,
shorten the time. Worst case, Is…
The… you need to wait until all blood reach, right?
But the best case, you don't have to wait if it is the first word. If it is middle,
so the average waiting time will be half time off of block riding, okay?
That's a difference, but it only requires CPU-side change. You don't need to ask
Memory Bender change the interface, okay? That's the difference.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
When processor gets that, since the first critical word is in the first, while you
are reading, it can
Supply very quick.
The other early restart is something you do in the processor side, whereas critical
work first happens in memory side.
Early restart, there is no change in memory size. You send the block ID with…
whether it's a read request or a write request. If it is a read-recast, with the
block ID,
Memory will read from the first byte of that block, and then the data will be
shipped out in that order.
Okay? So if the critical word that CPU needs to read is in the middle, you need to
wait until a whole…
Reading points reach that point.
Usually what happened, the… you are… you got the whole block from low level, and
then you copy this to the upper level, L1 cache, and then you are using offset 2 to
get the word you are looking for. Early restart doesn't do that way. When you got
whole block from low level.
then you, right away, try to initiate the critical word part you are reading, and
the supply before you send your copy whole block to L1 cache.
The effectiveness of these strategies depends on block size. If the block is big,
it makes more sense, and the likelihood of another access to the portion of the
block. If you have a big block.
For the spatial locality, it only makes sense that other parts of a block will be
accessed soon.
So that it takes a long time. Then you can think of a critical word first, or a
little restart.
Kim, Eun J
Kim, Eun J
[Link]
Okay.
But I want to advocate these two with another reason. Can you think of?
So, think of other techniques we discussed, like reducing heat time.
Right? Or improve visa rate. Yeah, we will talk about improving reducing misery
through compilation… compilation time.
But, you can improve miss… increase miss rate by having big cash, right?
But then, what is the side effect of having big cash?
Heat time increase, right? So, although you have an average memory access time
formula, it's a heat time plus miss rate, miss penalty, three parameters, these two
heat time, miss rate, kind of, you need to find the sweet point, right? Trade-off.
How about this penalty? These two we just discussed?
the … do they affect any other two parameters, heat time, or… Biserate.
No, nothing, right? Nothing to do with the heat time, nothing to do with the miss
rate. Do you see? It only reduced miss rate, okay?
So that will be… it won't be that dramatic, because you're… you're, like, how long
it takes to write whole block versus a half block.
Right? But still, that much of a reduction will be deflected into average memory
access time directly, without any implication of other parts. Okay? That's a good
thing.
So, yeah, a lot of processors adopt this, okay?
Let's go to the other one.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Merging right buffer.
As we discussed before.
When you have a write request, we will have a write buffer between CPU and memory.
not to intervene CPU actions. CPU when you have a right request, so you can put
that request into right buffer, and then don't have to wait, you can go for other
instructions to work on.
This right buffer will be used for both write-through or write-back policy.
When you store a block that is already pending in the right buffer.
Without right merging, you will have another entry, right? But here, you will only
update the right buffer. When you have a same
buffer that block you need to write, you will update that. So, number of…
Recast in the queue will be reduced.
That's why it reduces stores due to full write buffer. Remember, when you have a
write request, you wanna write that… put in
in the right buffer. If right buffer is full, CPU, nothing but stall, waiting, keep
checking that right buffer becomes available, so we can avoid that time. Actually,
we noticed that this stalling happened a lot with high
Right vendors and required applications.
Note that we cannot apply this to I.O. address. I.O. mapped memory, we can't do it,
okay? So in this picture, it shows if you have one buffer, one block, and there are
four different writes that happen when you merge, it becomes one request from four
regcast. So the rate of growing write queue will be really, really small with a
merging buffer.
Kim, Eun J
Kim, Eun J
[Link]
So do you understand what I talk about? So there is a buffer, okay? There is a
buffer between CPU and memory. Whenever you do write-through, write-back, write
request, it will be, you know, stored in the right buffer.
Then, since memories are slower, and most of a memory controllers give a higher
priority to read, so read… the write requests are sitting in the buffer for long,
okay?
So what happens when you, in the figure, you see, once you recast 100 block, then
most likely next recast will be 104 and 108, because of a spatial locality, right?
So instead of handling this individual one by one, we merge them, okay? We merge,
then the length of a queue
Okay? Will grow less, because you, instead of having 4 items in the queue, you will
have 1.
Can you see that?
So, when write buffer… okay, think about it. So, CPU has a… whenever it has a
store, and then it… right through, you… whenever you have a write, you will write
to the buffer, right? What if write buffer is full? What CPU does?
Nothing but stall, wait, until the… empty spot…
happens, right? So it helps to improve throughput.
By shortening length of a buffer. Can you see that?
So, when you buffer… new request comes, you will see if there is any consecutive
blank number requested. Then you just append it there. You can only change how
many.
blocks for this, starting
starting block number is this, and how many more blocks I should. Why it works? If…
I'm not sure if you remember, when we talk about memory, in memory side, what do we
have in front of memory? Do you remember? The read happens, so the address will be
decoupled the row and column, but what happened?
When we hit the, you know, read row, instead of read only that particular block,
what we do?
Whole row, right? Whole row will be written to row buffer. There is a buffer, big
buffer in memory side.
Okay? So, then we will have a really high heat rate when we have a special
locality. So this will work well. And in GPU,
So, since we are getting into GPU architecture, GPU is this way.
There are a lot of, you know, array, big elements, computation, and then you have a
parallel processing. And what happened is, they…
Aim to design high-throughput system.
Okay.
So then, like, in CPU, what happens? When you have a load, oh, it is a miss, and
then what? We should wait until this data is brought from memory, and then you can
go it, right? But GPU, what it does, when you have a miss, what they are doing?
Memory miss?
They… they do context switch, okay? So, with the next set of slides, we will talk
about virtual… virtual machine, virtual memory, and that is related to context
switch. So, let's go slow.
Let's hold a discussion, because you need to understand what is context switch
first.
Okay.
So we briefly discussed about virtual memory. There is a physical memory fixed,
okay? But we use, you already know, we give an illusion to the user that we have a
bigger memory space, the illusion coming from where? Where is a bigger memory
space?
Your physical is fixed.
Where is it?
What is it?
I.O, disk, okay? So there is a disk, is it?
You know, terabyte and big, big one. So we give an illusion. Why I can give
illusion?
This is it. Okay, so let… I think every undergraduate-graduate course, I give this
question.
So, to…
let you understand the multiprogramming and virtual memory thing, okay? Virtual
memory exists because a CPU, when you have one CPU, actually there are multiple
programmings going on, okay?
So this is a typical example, and I failed to answer correctly when I took a
candidate's exam.
20 years ago. Okay, this is the situation.
So one night, let's say 12, I know, it's, already 16 years ago.
Okay, so my daughter, 2 years old, she's playing with her toy, and my son, 14 years
old, he's working on
calculus problem.
Okay.
At the same time, They call me, Mom!
Then, who
I will serve first. This is a question. That was the question when I took a
candidate's exam for operating system.
Okay, so you have IO bound, the job, Porsis.
CPU computation bound.
Which one you serve first?
That's a great time.
Yeah, so you give a right answer, right? So, hold that, you know, computation I.O.
bound things later. So in my evening, right, 16 years ago, of course, I will serve
my daughter first.
Do they?
Okay.
Because she's cuter… But what she was asking for is this.
Mommy, I need a cookie, okay? Cookie was the, you know, cup of refrigerator, okay?
Whereas my son, Mom, I don't understand this.
Okay, on calculus problem.
So, which one I reserved first? You weren't look wrong with this question anymore,
right? I was wrong. I put… okay.
IO bound versus computation, but which one sounds more important?
To you.
Computation, right? So, I put computation, and then I got the deduction.
So… Look at that.
You can imagine mommy is a CPU, okay? There are two jobs, okay? This is the way we
can provide multi-programming, okay? Both of them think they have
100% of my attention.
However, I'm the only person, right? CPU the only one, but you can run multi-
program. How?
Do context switch, right?
So, this question is about scheduling. When both the jobs come, which one you will
serve? Yes, IO-bound, because my daughter, once I give a cookie, it takes less than
1 minute. She will be quiet, like,
10 to 30 minutes. She will play with her Legos and quiet, right?
But once I sit with my son, like, explain whatever, differential education,
whatever, it takes more than 10 minutes of my full attention. Can you see that?
Okay.
So, which one…
help to improve throughput of a system. So you think about it, our system design,
always we think.
Throughput first.
Okay? So, shortage of IO bond faster.
So Linux Conner, Linux operating system, what they do, they have a queue, okay? So
operating system run the queue, the… maintain the queue.
Then, any time they use a CPU, okay, they start with a low bound, like a weak
level, because we don't know whether they are I.O. bound or CPU bound, okay? So we
will start with guessing, okay, IO bound. We give a…
like, let's say, 20 cycle of CPU time, and you still couldn't finish means you
require more CPU time, right? So next time, when you come back, this will be lower,
the priority, it will go to lower queue, okay?
Can you see that? So as you require more and more CPU, you have a different
priority, your priority goes lower and lower. So CPU always serves first the
shortest job, and then go down, okay?
So that's how we do things. Then how about context switch?
with this context feature? While I'm working on calculus problem, I can't stop
And then I usually give, okay, read this part, or do this exercise, then I serve my
daughter, right? Then he still thinks he has my full attention. And then even my
daughter, I play with a little bit, and then my memory, okay, the own calculus is
limited.
like, 40 years ago, I studied. But then I pick that, right? So that's the context
switch. Whenever context switch happens, the PC value and memory cache value is
switched. So that…
CPU works, okay?
Alright, so let's go back to contact switch on GPU, okay?
So in CPU, when we say context, it's a…
Process level, so we talk about process is a life of a program, right? It's a big
program.
So you have a global memory, local memory stack, like that, but in GPU, they call
it thread, okay? They have a thread block.
is a unit of a scheduling. So, thread block, you can think of it this way. You have
full loop, it goes to 1024, then you are…
the processing unit can handle 16 at a time, then you make up to 16, from 0 to 15,
16 to 31, you make a block, okay? So you run that in parallel, and then when it
reaches load, memory miss, right?
What it does.
you contact switch to another WARP. We call it WAP, okay? So we are hiding Hiding
memory latency buying?
Having many, many of
thread blocks running at the same time. We do continuously contact switch. So, the
big difference between CPU and GPU, context switch is handled by operating system,
CPU, okay?
But GPU, we do in hardware. So when, you know, NVIDIA does, at the beginning, does
well, and, you know, all of us jumping to Warp Scheduler, and we had a lot of
papers 20 years ago. I should have bought the stocks at that time.
I wrote the paper, I wrote the… I worked on that, but…
Anyway, this is a side story.
Not only me, a lot of our people, we joke about it.
At that time, because NSF funds a lot of projects on GPU design. We knew it's very…
because why?
CUDA library. Because of CUDA library, the general purpose computation can be
happened in the GPU. That was the main thing. A lot of people now say, you know,
NVIDIA does well because of AI. No, it comes later, okay? AI, actually, NVIDIA, the
GPU is, you know.
rising because of, CUDA Library.
Anyway, so…
So, merging buffer, I want to connect with the most recent topics in GPU2. So, as
you can see, GPU, when they have a thread block, full loop from 0 to 15, then they
will have a miss on 0 element
first element is, like, 16 of them, right? Think about it. Each one, they will
generate right request.
Without merging, you're gonna have 16, because can you see that?
Okay? So what they have a separate terminology, they do say qualescing memory
access, okay? So whenever they have a miss, they don't send to the mem… even right
buffer right away.
They get… they have a little bit of time waiting, merge other misses from other
thread, okay? Then they put them together, and they merge, okay? It's very similar.
Folks.
In the right buffer, I mean, when we're, when we're not merging the right buffers,
right? Shouldn't the second address, like, 108, should it be aligned to the column,
2?
I mean, how… I mean, when we… when it's, like, stored, was it stored? No, this one,
you, you, you think you… should we put it here?
Right next to it. Should we align to the next file, or… Align to the next one,
yeah?
Oh, I'm awesome.
No, it's… this… this just shows simple way of handling write buffer. You can
imagine this is first entry, second entry, like that. So it grows for every miss.
But then, when you merge, what you do, when you have a new one, you examine the
address. If it is consecutive, you put it there.
Okay.
So, it will, you know, the ripe buffer grow very slowly.
This is gonna work for, the other vendors list?
Well, that's a good question, right? So, let's say you're… you're… but actually,
when we talk about GPU applications, it won't happen.
Right? Right. Yes. But CPU… so, first you have a miss for 100, and the next time it
was 116, then it's not conjugate, maybe you will occupy, right? And then the third,
let's say 108 comes, then you merge. Can you see that?
Then the question is, how long do we wait for consideration? No, no, no, no. So,
you don't let
Only when you are in the buffer, this is the time merging happens.
So let's say, this was the first, before arriving third one. It has been served,
the queue is gone.
Do you know what I mean? So we merge only among the outstanding requests, okay? We
don't wait. We don't wait, okay. We don't… so merging happens only when you are
waiting. Why it happens? Because your memory is much slower.
Okay? We don't add any delay. Some, you know, the…
Proposal was kind of lazy, whatever, because you delay a little bit so that you can
do scheduling better, something like that, you can try, but this simple merging
buffer idea is you're merging during your waiting time.
So, on this case, the increase in performance specifically coming from
a reduction in the number of comments we're sending to everyone? That's true. Yeah,
that's the main reason. Because if the banks are interlinked…
Even if it's set for command, they will be excluded in there. We only have to send
one request. Yes, yes. So, your store time, remember, we are more… the… the CPU
time is more critical.
Okay, when you write, writing is not in the critical path of execution, right? We
don't care, actually, to be honest, how long it takes.
However, so CPU, me, whenever writing happens, I throw, you know, my request to the
buffer.
So I have empty hand, after I put, right?
But if the buffer is full, what do I do?
Nothing, but I need to hold the… I don't… if I don't have any structure to hold
that recast in CPU side. Do you… do you see what's going on? CPU cannot do
anything. So finally, at that time, this right miss or write request.
Store CPU time.
Okay? So that's very critical we want to avoid.
Okay, so why do we need to have a pop-up?
Think about it.
Because writing is slow. Yeah.
So, so, I'm a very impatient person, okay? When I work, okay, I'm faster, okay?
Then there is another person very slow, I, I won't say who, okay? And then, I
prefer a buffer.
Because I cannot… I'm done?
I cannot wait until he get the magnetic cast, okay? It's asynchronous, it's too
slow. I'm impatient. So what I do, whenever I'm done, I throw.
Then, with his speed, he will get up.
And they do it. Can you see that? Because we need a buffer wherever the two systems
work together with different speed.
Isn't it? Okay?
So… so I throw, I throw, let's say this table can have… hold only 10.
He's very slow.
10. Then 11th one, what I need to do. I need to wait, and then I feel crazy, right?
I will start to yell, right?
Hello, Raggedy. Can you see that?
That's a CPU stalling time. You cannot let CPU stall, okay?
All right? So, if you have a buffer limited 10 entries, if you let them merge, you
can have more than 10 requests, right? So, I won't see that much of a time
difference much.
I gave two good examples you won't forget, right?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
be really, really small, with a merging buffer.
Note that the techniques we just discussed, critical word first and merging writer
first, belong to the techniques to reduce this penalty.
Now, we're gonna talk about how to reduce miss rate. Mainly, we are relying on
compile optimizations. You will be amazed how compiler optimization can reduce miss
rate significantly.
Compiler optimizations.
Kim, Eun J
Kim, Eun J
[Link]
These techniques with GCC, if you use a GCC, GCC option big O, they do that.
Okay, so when you do matrix computation, whatever, if you do option B,
optimization, the option you give, then compiler will do for you, okay?
This is, I think, very interesting, because if you read the nowadays.
computer architecture papers, there are a lot of tensor parallelism, and, you know,
those are matrix… actually, machine learning is nothing but matrix computation,
right? So, you will see some of our ideas. Actually, it's fundamental things for
the
Machine learning accelerator design.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Earlier, when cache is just being introduced.
1980. Just after that, McFarlane found how to reduce cache misses through a
software compiler.
It's really amazing results they got. For instructions, code segment. They can
reorder the code segment for each procedure in memory so that they can reduce
conflict misses. For example, procedure A and B, back-to-back, it is run, then
You can have a little bit of a change in the order of a definition of that
procedure. You can reduce conflict visits.
They do this through profiling. Profiling means you run it before to figure out how
many conflicts happened, you can change the location of a code segment and see
which one helps to reduce conflict in these four code segment instructions.
There are four techniques for data. Again, it is to reduce miss rates, okay? First,
merging array, you will see the example, and loop interchange also, it will be
really useful when you have a certain way of access array memory, then you will see
that.
Fusion, also, instead of two independent loops, you can have one loop, you can
overlap the access other variables. And blocking, we will talk about in detail.
So let's look at merging a raise example.
The firm…
Kim, Eun J
Kim, Eun J
[Link]
I really like this example. With this, finally, I understand why we prefer… have,
like, the computer science, when you learn first class or object, why we do need to
declare things as an object or class together.
Look at this.
Can you, just from here, can you see why?
In the cache aspect.
What happened?
You don't have to decide.
This is… we make it as a class, right? Put them together. And the other, you…
handle separate. So this is the way I use the program, because I hate
For me, it was so hard to understand pointer, how pointer works, so I avoid this
kind of way, and I was okay, right?
I can't stood up night to get the results.
For this, you will get results in one hour if you have a big array. The other, you
need to wait a couple days.
Why?
Cool.
Will this method have to be combined with group fusion? Because otherwise… No.
Cool. No, even… even… Think about it. Okay, so this is, actually a very good,
thing, you know, you know KVCash in LMM?
So to generate next token, you do matrix computation. They have a K, Q,
B matrix, and these two are very critical.
But then, you know, it's too big, so they usually have a separate matrix, okay? But
if we put them together.
And the… When we need the value for the key, can you see that? Nature of these two
parameters?
You are… you need them all the time?
Together, isn't it?
Okay, so then, what's the pros and cons? What's the problem of having separate
array for each parameter?
You didn't exp… so… Which way you program.
When you program.
Usually we do this, right? Because it's simple. And now you have a reason why you
learned the objective in class.
We put them together.
whenever we use i value, we need i, the key, together, right? So if you have a
separate
Oray, what happened?
to bring value, you bring whole block of value, right? What if…
Key, that block, has a conflict miss with the value.
blah. You all the time has missed.
Can you see that?
But when you put them together, they're always a hit, isn't it?
Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
First two lines, U,
declare each variable separate. Okay, so then you will have two sequential arrays.
Whereas second, you use a structure, you define merge with the val and key back-to-
back together. So if you use this way of a declaration, do you agree val and key
with the same index will be residing in memory, back-to-back consecutive?
So, you can exploit the spatial locality. So you… and we can reduce conflicts
between valid and key. It had… the earlier one, it happened to have the same index,
then you all the time have misses.
Loop interchange example.
Kim, Eun J
Kim, Eun J
[Link]
So, look at this. Just the…
So, if I give you this example during interview, I am looking for one keyword, so
what do you need to come up with?
To explain this phenomenon.
So, have you heard about… Rule major and column major.
So, your memory is a two-dimensional, or one-dimensional, or a three-dimensional,
or a four-dimensional?
We do… do you remember Turing machine example? We have a one-dimensional infinite
tape, right? It's one-dimensional. What do you have in this example?
Cheers.
two-dimension data, right? So we need to…
Transpose two-dimension information to one dimension.
So, there are two ways.
Rule measure means all same row data together first.
Okay?
And then column major, column data first.
So, which one works better with what? Okay.
So, this changing I first, whereas this change J first, right? So this will work
well with what?
color major, this will work well with room major. Okay, and then in modern
computers nowadays, most of a computer, I can say is room major, so this one works
better.
Okay.
Actually, a long time, I think it was my first semester, I taught this, and then
one of the graduate PhD students working in the bioinformatics, and she went back
to her code, and then looked at the code, and she changes some of the things I
learned from this class.
And then she said she used to wait 2 weeks to get one point of data, and a couple
hours to get the base.
Both.
Cash is amazing, okay? Because people… software people don't… aware there are
caches, right?
Like, our department,
Tim. Tim, is very famous for Metalab, right? Graphics, applications, he come up
with some change of optimization and code. Actually, a lot of his work, aware of
cache, and he did some of adjustment.
It's amazing, like, how software level you could get the better performance.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
I will give you some time to observe these two. We declare X array, you are
consecutively located in the…
Then you are jumping around every 100 words.
Whereas…
Kim, Eun J
Kim, Eun J
[Link]
Okay, so in final exam, I can give you this kind of code instead of a sequence of
memory address.
Do you remember? So far, in memory, in this section, in this chapter, I gave all
the sequence of memory address, right? What if I give this?
Okay, and then I gave a starting address of X array. Let's say it's 16.
then you know this is INT, INT4 byte, so what is X01 address? You need to know it's
a rule measure, right? So, after 0, the next data will be 01.
Right? 02, okay? And then, maybe if it is a… the array size X33, then 0100102, then
it will go to 101112, like that, right? You can come up with all the memory
addresses, right?
Okay, I told you…
And the final, instead of memory addresses directly, I can give you the code
segment so that you understand what's going on with this.
Right? With this code, you will start with 0, and then it'll go to 0, 1, then 0,
right?
And the one zero, like that.
Okay? Then, with the starting address and size information, you know what is
addressed for each element, okay? Then you will have the same problem.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
As the aftercode You are accessing data in the order of it is located, so you
improve spatial locality.
This gives a… Huge improvement.
After code change interchange loop.
Instead of striding through every memory every 100 words, we are having sequential
accesses, which improve spatial locality. Of course, you will have a higher hit
rate. That's how you can reduce this rate with look interchange.
The third compile optimization technique to reduce this rate is called loop fusion.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so before and after. I prefer before.
Is there any reason?
If you are during interview, this is given, and discuss about this two different
ways.
So why after is better?
you should imagine N is very big, okay?
the C array is accessed the same place by both, therefore the second access should
be cached.
So look at this A, right? So you have a hit.
a hit.
But there is no guarantee. But then why people still code like before?
why it is convenient? We need to use some computer scientist terminology, right?
Components, that you have these separate, like, each one…
Is it, yeah, debugging or readability, right? Your code, actually, I think number
one
attribute you want to have is readability, isn't it? You code today, tomorrow you
don't know why you called it this way, right?
So, actually, that's true. So, but the thing is, if you use a big option O in GCC,
they will do this, okay, for you.
Alright, so you can still maintain readability, but your performance will be like
Aventure, okay? So let's,
Move on. Off.
The fusion, opto fusion is easy, but this blocking, I really want you to pay
attention. I won't give any, you know, final exam on this, but these things is a
lot of peer nowadays in our community.
Because I talk about KV cache, right? When you're an element generate token, you
need to have a cache, and then KV cache matrix, matrix computation, matrix, matrix
computation, can you see that?
Okay? And after all, it will be computed in the computer, right? Or CPU, or
accelerator. Well, with the cache storage, right? Your storage is always limited,
so we try to
Maximize locality. The data you brought, you use it all the time, okay? So we call
it automatic intensity, we will talk about it later, but once you brought a data.
then when you make a smaller block, this will be used at least, let's say this is a
blocking factor BB, B squared times, at least. If you do this, you use only once.
Can you see that?
So, this is a, you know…
sounds irrelevant, but if you understand this feature, you can understand most of,
you know, tensor parallelism, accelerator, GPU, the memory problems, because
nothing but matrix computations, okay?
Bye.
Goodbye.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Here in this example, imagine n is a very, very big number.
Then, if you look at the code, what happens in two inner loops, you read all n by n
elements of Z, and you read n elements of one row of Y repeatedly, then you write
the n elements of one row of X.
Kim, Eun J
Kim, Eun J
[Link]
This is exactly counter for machine learning. You have, like, you can imagine Y is
input, and then G is weight value, isn't it? Weighted sum. To calculate the
activation value of that layer.
What happened?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The number of capacity misses will be 2 times n cubed plus n squared, assuming
there will be no conflict.
The idea to reduce number of capacity misses will be we can use a submatrix, which
is smaller than N by N, and then that B by boost submatrix can fit into cache, so
that we will have less number of misses.
This figure shows a snapshot of three arrays X, Y, and Z, when n equals 6 and I
equal 1.
The dark shade indicates a recent excess, a red shade indicates an old excess.
And white one means not accessed yet.
So, the elements of Y and Z are re-read repeatedly to calculate new element of X.
The number of capacity misses is clearly depending on n and the size of the cache.
If it cannot hold all three elements n by n matrixes, then all is well, provided
that there are no cache conflicts.
Kim, Eun J
Kim, Eun J
[Link]
Do you notice this J, you know, Z? Because your X's go through column every time
you can assume it is a miss.
Isn't it? Like, this K… this Y goes through K, so it's a spatial locality you can…
you can expect, right? But this…
do you know this matrix computation? To get this element value, you need to
multiply this with that, right?
And then this is a kind of, color measure, so it will be heat, heat, heat, heat.
But these, all the time, there are danger of misses.
Okay? Every time you access G matrix, there will be miss, okay? So that's a very
expensive… and then G will be
Access the how many times? Three times. It's the innermost loop, right?
Okay, that's up.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
If the cache can hold one n by n element.
and one row of N, then at least the it row of Y and the array Z may stay in the
cache. Less than that, a misses may occur for both X and Z.
In the worst case, there would be 2 times n cubed plus n squared memory worth
accessed for N-cubed operations.
To ensure that the elements being accessed can fit in the cache, the original code
is changed to compute on a submatrix of size B by B.
Then two inner loops now compute in steps of size B rather than the full length of
X and Z.
B is called blocking factor.
With the blocking factor B, the earlier code can be rewritten like this.
And the figure also shows the case with B equals 3.
As you can see, contrast to the earlier figure, we see the smaller number of
elements are being accessed.
Kim, Eun J
Kim, Eun J
[Link]
So, by changing code with a blocking factor, instead of handling big matrix, you
make a three… like, for example, this 3x3.
Your cache is big enough to hold this together.
Then, see, look at this. When you compute this, you need this in a row, but then
the column. You have this G matrix in the cache.
So, this is a very, good idea, and, well…
Widely spread techniques, and then with a similar thing, like a… so now, if you
look at computer architecture papers, they do these kind of things, and then we
exploit a lot on sparseness.
There are a lot of geros here, and how can we, you know, reduce number of
computations exploiting zero positions, or compress
Like, it was even… Several years ago, okay? I don't see the exploiting sparseness
For the computation, not much, but then you can think about memory, compression,
that kind of things coming.
Okay, so we are done with, this…
memory, and then let's go to the virtual memory. So we talk about virtual memory,
so we can quickly go over.
the… PFR.
Okay.
We talk about prefetch also, and then this can be quick, and hopefully we can do
both, queries together.
I already explained, I will go…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, I'm going to give you a review on virtual memory. The
reading assignment for this slide itself, you can find section B.4 in Appendix B in
our textbook.
What is a virtual memory?
In Wikipedia, you can find definition of virtual memory. Virtual memory is a memory
management technique that provides an idealized abstract of storage resources that
are actually available on a given machine.
Which creates the illusion to users of a very large main memory, bigger than
physical memory space.
Using secondary storage. Here, key word you need to keep in mind, a CPU processor
run multiple program where the processor is going on.
So let's begin with the older time story.
At an instant time programmer.
Computers are running multiple processes, each with its own address space. It would
be too expensive to dedicate a full address space worth of whole memory for each
process, especially because many processes use only a smaller part of their address
space.
Therefore, there must be a meaning of sharing a small amount of physical memory
among many processes. One way to do this is virtual memory, okay? Divide the
physical memory into blocks and allocate them to different processes.
Inherent insertion approach must be a protection, okay, protection scheme, that
restricts a process to the blocks belonging only to that process.
Most forms of virtual memory also reduce the time to share a program.
Because not all code and data need to be in physical memory before a program can
begin.
share one address space, which means the physical address space is shared. Machine
language programs must be aware of machine organization, and there is no way to
prevent a program from accessing any other machine resources.
Although protection provided by virtual memory is essential for current computers,
sharing is not the reason that virtual memory was invented.
If a program became too large for physical memory, it was a programmer's job to
make it fit.
The programmer ensured that the program never tried to access more physical space
than it was in the computer available.
That proper overlay was loaded in proper time. Virtual memory was invented to
relieve programmers of this burden, okay? Programmers doesn't have to worry about
it. It automatically manages two levels of memory hierarchy represented by main
memory and secondary storage.
So, as long as it provides the mapping of virtual memory to physical memory.
We can use this secondary storage also, give illusion to the bigger space.
So with this virtual memory, usual programs are run in a standard virtual address
space, and then we need to have address translation.
Address translation hardware managed by operating system, remember that? It will
convert a virtual address used by a user program CPU to physical address. Either
it's in memory, or if it is missed in the memory, it will be in the secondary
storage disk.
So, we need to maintain address translation hardware, mapping virtual address to
physical memory. This hardware supports modern operating system has. We can provide
protection, translation, and sharing.
This figure shows the mapping of virtual memory to physical memory for a program
with 4 pages, A, B, C, and D. The actual location of 3 of the plus A, B, C, is in
physical main memory, and the other, D, is located on the disk.
There are three advantages to using virtual memory. First, translation. Program can
be given consistent view of memory, even though physical memory is scrambled. So
you can give contiguous virtual memory space.
Also, it makes multithreading very reasonable.
Kim, Eun J
Kim, Eun J
[Link]
You learned this from operatic system, right? Okay. So, some of you, actually, when
you pick the paper, if you pick the paper on TLB, okay, we will talk about TLB,
it's a page translation.
cash.
Or these virtual memory things. It's more operating system that work.
So you need to have a full system simulator. It's different from, like, a GSIM.
GSIM, we have only microarchitecture levels, right? So that's based on the topic
you choose, the simulation tool you should use is different.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Only the most important part of the program, so-called working set, can be in the
physical memory.
Contiguous structures like stacks, use only as much physical memory as necessary.
But still, grow later.
Second advantage is protection. Different threads or processes protect from each
other, so you cannot
Access memory belong to other processes.
Different pages can be given special behavior. You can have read-only, or make it
invisible to user program, if it is cono memory.
Connor data protected from user programs.
Very important that protection from malicious programs can be done through virtual
memory.
The last advantage is sharing. You can map some physical page, common physical
page, to multiple users, so you don't have to have redundant multiple copies.
In virtual memory, main memory, physical memory acts as the cache of secondary
storage.
This is the system.
Then, we can revisit the four basic memory hierarchy questions. Do you remember?
The first? Where can a block, here is page, be placed in main memory?
Kim, Eun J
Kim, Eun J
[Link]
Me.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It will be fully associative, which means a page from disk can be placed anywhere
in main memory.
Kim, Eun J
Kim, Eun J
[Link]
Do you remember any policy you learned in operating system course?
When you allocate to memory.
But you need to find the space in the physical memory, right? When users need.
Like, first feet… do you remember?
then, you know, Java, they have what? The garbage collection, those are all memory
management, right? They go through memory, and then some, you know, chopped place,
they put them together, make a big space.
Which means… There is no direct association with the
address. Remember the block placement in cache. We use memory address, and certain
point, the bits of…
Tells on where it should go.
But from disk to the memory, physical memory, placement is fully associated. It can
go everywhere. Then policy maintained by operating system use that flexibility to
make the best out of it, okay?
Right? In Hardware Point, there is no limitations on placement, okay?
How is the list of free physical pages maintained? Is there a separate structure
for that? Physical… the free list?
We will see that.
Okay, so does… does the memory module maintain the free list, or does the OS
maintain the list? OS needs to have a free list, right? Then whenever the new, new
page is required by process, it should give, right?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The second question was, how is a page found if it is in the main memory, which is
called block identification?
How do we do? We will visit this more detail later, but we're gonna have a page
table for mapping.
Sorry, the question. Which block should be replaced on a virtual memory miss? Okay,
here, actually…
Kim, Eun J
Kim, Eun J
[Link]
So, here, you can always connect with the cache. So, fully associated, what do we
do? In the cache, right, we go through the tag, because it can be everywhere, you
compare with the tag if there is a matching. Can you see that?
But remember, when we do work on this virtual memory, your cache is a physical
memory. It's a bit, right? And you handle, you manage it per page, 512 bytes, okay?
Then, maybe you can have a tag. What is a tag?
Okay.
So what is TAG?
So, I, I don't know how many of you are, like, the link… You have…
Try to understand this a holistic way.
Where was it?
Okay, yeah.
So, do you remember when you have a cache block offset and, you know, cache index
like that? Here, we have physical address, and then what is this?
Page offset. Instead of cash flow, we handle it by page. And then the rest?
The rest is a tag, right? Here, we call it page number. If it's a physical address.
It's a page number. If it is a virtual, what is it? Frame number or virtual page
number, like that. Okay, can you see that? These are…
different. We do need to do tag matching, but we can't do every… swap all the
memory space, so we keep this mapping in the table form, okay?
Because we can put everywhere.
Okay, replacement policy… I don't know whether it's a strict LRU or not.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Do you remember? The first, where can a block, here is a page, be placed in main
memory?
Kim, Eun J
Kim, Eun J
[Link]
I won't go through. So, replacement policy, okay, replacement policy.
Again, it… we discussed about how hard it is to maintain strict LRU order, right?
It has counter, and the counter has a limited number of bits, and there is
approximately wraparound. So, they do…
try to provide a value policy.
having two bits, okay, used, and the reference, and then… and then 30 bits. So
there are two bits when it is deferred, and then when it is
the written updated, and the combination of this, they will see which one should be
kicked out. All zero means it's never been used, right? But what if they have all
0, many of them?
Right? So that's the operating system domain.
These abyss are…
provided by hardware, but using this and real replacement done by operating system.
So, whenever we talk about virtual memory, it's the operating system area, it's
interdisciplinary.
We are doing.
Okay? And then write policy is about I.O. disk, okay? So we have a dirty bit. We
only update a page in the physical memory, and then when it swaps out, then it will
be returned to I.O, okay?
Question?
Okay, so you said the reference bit for the…
Virtual page is provided by Hardware?
Yeah, we… we provided that bit. Okay, so when we have a… we will see the structure
later. And then TLB, also, we… we use those two bits to provide a replacement
policy. And when you said it is provided by the hardware, is there a register?
No, beat is a flag. You need just one D flip-flop.
So there's 1D flip-flop for each virtual page in memory? No, physical page, right?
Oh, one for each physical page. But then… When it is valid, right? If it is
invalid, which means it's not used.
What are the number of physical pages depend on the amount of RAM installed in the
system? So if I install 4GB modules and I add another one.
Yeah, so you, you… How do we increase the number of deflip flops you account for
these phases?
So, I don't think we…
the CPU will dynamically put flag there. In the DRAM design, per page, you will
have this bit flat. You should have that bad.
Okay, and then… then TLB, mainly. TLB is also table form, we will talk about it,
it's cache in CPU side. Then there is a 2-bit information, that is a CPU side. You
need to have a flag.
I'm saying that this reference video is thinking, like, a DRAM module. DRAM module,
if it is a page table, it's a DRAM module, and the cache of a page table is called
TLB, that is in CPU side.
You need to have a test structure. So all this structure, hardware provided by
architecture, but maintaining and use those information for those policies is done
by operating system.
And they have a way to set this bit. It's an interdisciplinary.
So, per each machine, it will be different, which would… like, operating system
would change the bit, or hardware will change the bit. Because in some operating
systems, I've seen page tools contained by the OS that have a reference bit. Yes.
The memory… Yeah, memory, yes, yes. They use a separate bit information, yes.
Okay.
Alright, so let's skip this page, because, okay, I need to talk about this.
Have a good day.
So this illustrates how table works.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
30 block will be ready to take place.
Kim, Eun J
Kim, Eun J
[Link]
Page tables encode the virtual address space.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
A virtual address space is divided into blocks of memory, which is called pages,
okay? So, contrary to the cache, we call it cache block. Here, we call page.
So, in previous slide, I explained the placement is fully, right? So, from virtual
address space, anywhere, it can be placed in physical address space, so that in
order to find a physical page, we need to go through all the page space, right?
If we follow cache tag matching, you need to go through all every tag matching, and
which is too big. So, we are going to have a table, okay? Page table will have a
mapping between virtual address to physical address.
So here.
If we do contain all the mapping, the page table becomes too big, so some machines
support a different size of pages, okay, to reduce the page table overhead.
A value the page table entity could.
Kim, Eun J
Kim, Eun J
[Link]
Where is the page table? Page table itself is in the memory, because it's big,
okay? It's a slope.
Think about it, every time we run the program, we need to go through translation,
okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Physical memory frame, a frame addressed for the page.
Note that the OS manages the page table for each process. So, each process, they
have their own page table. So, when you use a virtual address, it will have frame
number, and then it will be replaced with a virtual page number when we need
translation.
Let's look at details of a page table, okay? The table will contain virtual page
number, And…
the frame number together. So, when reference happens, virtual address comes.
Virtual address will be decomposed into the field for virtual page number, green
letter, and then offset, 12 bits.
From 12 bits, we can figure out the page size, right? The page size is 2 to D12.
Bye, sit here.
Okay?
So the last 12 pieces will be used as an offset inside the table, and the virtual
page number will be used as an index in the page table, so you find the PA here. PA
means physical page number. And you see that this table will be…
Kim, Eun J
Kim, Eun J
[Link]
Oh, sorry.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
be indexed by virtual page number, and then you can get physical page number. Then
this virtual page number space will be replaced with the physical page number.
Offset will remain as it is.
That's the way Page Table maps virtual page numbers to physical frame.
Then you will treat this Physical memory as a cache of Disco system.
As I explained before, the replacement policy here, we are having fully associative
cache, right? Memory is fully associative, you can have any page, and mapping is
provided by page table. Then, when you bring a new page from disk, which page are
you gonna use?
operating system wanna have LRU, right? So here it is how you do. With a circular
queue, okay, circular queue, you're gonna have a two-pointer, head and tail, and
the Q means first in, first out, right? So you know the oldest one will be in the
header, and you will kick it out, okay?
So, when you have a page table, the 30-bit indicates the page is updated, and use
bit will be set 1 on any reference. So.
To mimic LRU is not perfect LRU. Operating system periodically clears the use bits,
and later records them as it can determine which pages were touched during a
particular time period. So, tail pointer move up, you will add after clear the use
bit in the page table, and then whenever it reaches to the head pointer, this page
will be placed on free list.
If the use bit is still clear, you didn't use it for a long time, and then it will
be marked as free. Schedule pages with 30 bits set to be written to the disk if it
is a head position, okay? This is the way they provide the sudo LRU with a secure
queue.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so let's do… Chris, we have time.
And then this crazy is so easy.
Okay.
Alright.
So… Your virtual address is 17… is it just a… the…
toy machine, right? It's not a real machine.
And then page offset is 14. What is the page size?
2 to D, 14, so 16 kilobytes, right? Okay, then, go to the next one.
Can you translate?
Mmm, this.
It doesn't look like a table, but, I put table, but…
So let's say this is a table.
Okay?
table.
How many?
Okay, so how many entries there?
8, okay?
So, can you translate this?
Virtual address to physical address?
So easy, right?
So, let me… let me show you. Here.
17. You discard the last 14. You have 3Bs, right? Okay, 3Bs?
Let's go to the table. 3 bits, what is 3 bits of this address?
100. So where is the 100? It's an implicit table, right? It's from 000, 001,
Alman, this is it, right?
So you go there, this entry, it says 1 means what? It's valid, means this page is
in the physical memory. If it is 0, it's missed, it's in IO, disk, right? So it's
1. Then what is address? 11. So what is this translation?
And then you copy the rest, okay?
Where is, okay.
So I… I didn't have the time to do, to, to do,
Okay, so this…
as it is, you copy, this will be replaced with the 11, okay? 11001, da-da-da-da-
0001, okay? This is your answer.
Okay.
Tlb, okay, we have one minute left.
What do I do?
Yeah, it's not hard, but it should be covered. It's important.
Any questions on these programs?
What? In the main memory, do we leave some phase, or… I mean, how much phase do we
leave for the… What? How much space in the main memory do we leave for those… We
can calculate, right?
You can calculate.
How do you calculate the page table requirement? Let's say in the previous example,
you have a 17, 16. How many entries do you need to have? 8 entries, right? And
then, the… each entry, you have a 1-bit plus?
2 bits, right? 3. So you multiply A, three, that's the total number of bits. That's
a page table requirement, right? For given address space, you can calculate the
overhead of a page table.
We were beyond the first, tourism and, buckets.
2 to the… that is a page size, a page size. That's nothing to do with the page
table size.
Yeah. Okay.
Three. Yes. I don't get your question.
So, so page table should give, pair information.
all the cases, 3Bs, it can start from 000 to 1111. It lists out all the cases. But,
okay, tell me… so what would be real… can you… can you…
Quickly search what's a typical page's table size?
32 bits system, and then, like, a… what is a page size?
5 total, right? 4 kilo… 4 kilobytes, so 12, right? So you have 20 bits. Can you see
that? 20 bits. How many entries? 2 to the 20.
Right? And then you can multiply each entry, you need to have a valid B, 30B
reference B, 3 plus?
Then the frame number size, right? That's a basic size overhand, okay? All right,
so I will change the date of, next quiz.
Could be delayed to the next week. It shouldn't take a long time, okay? Thank you.
We won't have a class on Friday, okay? Right.
It also says in yellow, like, the person…
Nov 10:
Kim, Eun J
Kim, Eun J
It's like… My friend, he sent me, like, 5 of you.
Okay.
Good afternoon, let's start. So, from this week, we are supposed to start the GPU.
the next chapter, but let's… finishing up memory hierarchy, the last topic is the
TLB. What TLB stands for?
What is it? You were supposed to read all, right? We're done with the reading
assignment. What is a TLB?
table.
What is it? Cancelation?
Look at some table. Look up buffer, right?
Buffer. Yeah, they call it look aside, but that doesn't mean anything to me,
either. What is it? I'm wondering why it's called look aside instead of look up.
Your name. So what is a table here? What's table?
What table we are dealing with here?
Give me a jizza?
Sure. What do you, what do you mean? So, TLB stands for Table Lookup, Google Size
Buffer, right? So here, T stands for Table.
What does translation look aside for?
Okay, let me check.
I'm so… It's the same. So, anyone volunteer to define what is a TLV, then? What is
TLV? Go ahead.
It's kind of like a cache that stores the mapping between the virtual frame number,
so page number and the physical page number.
Okay, so cash off what?
What do you have original information? What is it? Where is it? So, for last quiz,
what did you do?
So what's the… that name? Mapping? What is it?
Okay, you guys… Paige!
See, in final exam, we… we are, like, after memory hierarchy, you see a lot of
terminologies, right? And you cannot give a concise definition. See, this is a hint
for your final exam question.
You really need to know what it is, right? So…
Are you fishing for virtual address translation, or… So, yes, okay, okay, so… okay,
how do we provide the translation? Through what? Page table. Page table, okay,
that's why I thought the T means table.
Here. Okay, so TLB is a cache of page table, right? So what is a page table, then?
It's a mapping between… Portrait address?
physical address. And the virtual address was given to each
process, okay? So that's how we provide protection, right? You don't need to
memorize, right? So without
Knowing how it works, like, how it provides protection and sharing, whatever,
whatever, it doesn't come through your heart, right? But the… once you understand
the process is a live program. Every process, they have their own table.
Okay, only visible to those each process, okay?
It's individual.
So, what is a typical size of a page table?
you can calculately come up with, like, a virtual address space, right? If it is 64
bits.
Right? And then, what is a typical page size?
Page size? Oh, page size. 4 kilobytes. 4 kilobytes, 12 years, right? Get rid of our
last 12 years. That is…
The frame number, right? You need to provide the one-to- mapping, then what is the
page size? You can quickly come up with the number of entries, right? It's huge,
it's a memory access. So, we want to provide a cache.
then I think if you read this, because I was hoping you read this, because I… this
is what I was supposed to finish last week, and we couldn't.
I asked question in this slide set. Why it is a cache?
Why is not a table? Okay, let's look at… Usually it's… there's no prediction.
There is no prediction.
So, do you recall, I contrast between the…
Prediction table, branch prediction table versus branch target buffer, right? PTB.
Do you remember?
Target buffer, we call it buffer, but it was cache.
Why do we provide a table for prediction, branch prediction, versus a target
address we provide a cache?
Why is it?
Is it because we don't need one entry for every possible virtual address? Or it
would be too expensive to store one entry for every virtual address?
Usually we have a multi-level… Okay, so what's the big difference between a table
versus a cache?
table, you use a certain portion of address, and go to the rule, right?
Do you have…
But if it is cash, what do you do? With the cash index, you go through a certain
row, but then what do you do next?
Tag matching. A table, you don't need to do tag matching why?
Branch prediction table. First one you learn, like a 2-speed saturation bit
counter, and the number of entries, let's say, in the exam it was 8, then you will
use the last 3 bits. Without tag matching, you use, right?
the difference between a cache and a table, that a cache has aliasing because it's
a smaller copy of the actual data. Actual data, right? And then table, you don't
need to do tag matching. So, which means…
You use that information for? Different.
Addresses, right?
Okay, so here… TLB, also called cash.
So I want here, branch… Branch prediction table. Although these branches Trained by
another branch.
happen to have the same last 3 bits, if it is 8 entries, right? But I can still use
it.
Right?
What… why?
What's the penalty for the branch prediction… misprediction?
I mean, if you are predicted wrong, there's mechanisms to fix it. To fix it, right?
So we take a risk. We…
let conflict… conflicts happen, right? You have a different…
branches share same entry, but still, it will be okay. It will be wrong a couple
times, but then eventually it will run… learn from current branch, right? Whereas
target buffer, what's the problem of having wrong target address?
That's the first step. I try and update the table, so there's a lot to do.
So you let…
PC changed to wrong place, and then when you realize, -oh, it's not mine, it's, you
know, back up, you know, take away, that action is so costly, right? So, same thing
here.
A page-table translation that… Translation is very important.
thing. You cannot afford using wrong information.
And attacker usually, the security attacker, try to do that.
Okay? Mainly, they try to jump into random positions in the memory, even they are
not there
You know, not supposed to jump in there, like a kernel area, okay?
So, I think that's all we need to know, and then, let's go to the technical part,
okay?
Let me, those two.
Sorry.
This is all, so you need to listen.
Oh, it's muted.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
stored in main memory, and are sometimes paged themselves. Paging means that every
memory access logically takes at least twice as long, with one memory access to
obtain physical address, and a second access to gather data.
We can use the locality to avoid the extra memory access. By keeping address
translations in a special cache, a memory access rarely requires a second access to
translate that data.
This special address translation cache is referred as translation look aside
buffer.
TLP is a small, fully associative cache of mapping from virtual to physical
addresses.
Kim, Eun J
Kim, Eun J
[Link]
So this is all you need to know for the next quiz, okay? It's nothing but fully
associated with that, okay?
And this is true. And the one I want to correct to the…
the information I gave wrong. Like, in page table, you have a valid bit, and a 30
and reference bit, and the TLB, also, we have a 30 reference bit. Those are bits.
So, if your table is implemented as RAM, those bits will be as RAM cells, not
D flip-flop, okay? We want to have a separate… like, a D flip is a logic gate, and
SRAM is memory, right? It's two different technologies, you cannot put them
together.
But the… those table and TLB?
provides through hardware, architecture, we should give those structure, but the
way the bits are refreshed and set by operating system. So wherever in your
research you want to do TLB or cancellation, those are interdisciplinary area.
Between operating system and computer architecture, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So, a TLB entry is like a cache entry, where a tag holds the portion of virtual
address, and the data portion holds a physical frame number. Protection field,
validity bit, and usually a used bit and dirty bit.
To change the physical page frame number or protection of an entry in the page
table, operating system must make sure the old entry is not in the TLB.
Let's recall how page table works.
With virtual address.
you use a virtual page number as indexed to the table. Note that the page table is
huge. Each process has their own table, it's big, okay? So whenever we need to
access memory data, you need to access the page table, which is in the memory
again.
So, then you will get
physical page number with the virtual page number indexed, then you use the offset
as it is, okay?
The V equals 0 means the page is not in the main memory, right? In the cache, if V
equals 0 is invalid, it's a cache miss. It's the same thing, but another name here,
page fault. So when page fault happens, operating system handles this case, because
bringing a page from disk takes a lot of time.
So here, in this process, when we have a TLB, it will be work like this.
So, TLB caches page table entries as a pair, like you see here. So, a pair of.
Kim, Eun J
Kim, Eun J
[Link]
Before looking at TLB structure, let me just add one more thing. Page A4 handled by
interrupt, okay?
So, the interrupt, what it means.
your PC goes to the program, and then when you have a load or fetch instruction,
then you run into page default.
Then what happens is, you will change PC value to that service routine, okay?
And then.
This page transfer from I.O. to memory takes a long time, so usually there are DMA.
We don't discuss about here, but as you should know. So direct memory access engine
is a side in the memory side, so what it does.
just the CPU initiated that transaction. Okay, I need a page. Then, because I look
at that page transfer.
byte level or small section level, done through I.O, between I.O. and memory. So
CPU won't wait until the page transfer happens.
So, CPU order DMA engine to transfer a page, okay? And then CPU will go back to the
other context.
So all the time we do context switching. Why we call it interrupt? When DMA engine
finishes transfer, that page is in place, then ring the bell, okay? The CPU lets
CPU knows this is done, then CPU, because that process has a page for…
That PC changed to…
interrupt service routine for page 4th, right? And then it was the MA engine, and
then this will maybe, until this page is in the place, this process will be in
blocked state.
Okay, if you learn in OS, the life cycle of a process is active and blocked. So
whenever you have a… we call it I.O, because its data is in I.O, right? So it's
I.O. time. CPU doesn't have anything to do, so you will be blocked. Okay, that
process will be blocked. It will be queued in blocked state. So whenever the
interrupt comes.
Okay, the data is arrived, then this will be changed to wait state, okay? It will
be in queue to wait queue, then CPU, according to the scheduler, CPU will pick up
that process, then it will be executed.
state, okay? It's a big picture of how a job is scheduled in our planning system,
okay?
So, when you have a page fault, this is a very extensive one, and then a lot of
topics in handling page fault and the RDMA, those are operative system area, okay?
Alright, so let's go to TLB.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Like you see here. So, a pair of a frame number, which is a physical page number
with a virtual page number, stored in the TLB together.
Kim, Eun J
Kim, Eun J
[Link]
So, frame number acts like,
tag. So you have a tag, and then information, right? Do you remember? When you talk
about cache, we have a tag, and then, actually, we draw this vertically when
I… when I explained fully associative, fully associative, there is no index
So if we put… have a table, like, vertically, each entry looks like I have an
index, right? No. You can imagine, conceptually, these two are lined in one line.
There is no index, right? And then there will be frame number act like…
tags. So when you are looking for translation, you go through each frame to see if
there is a matching. Then you will return this page number, then the page number
will be used as a
Pitch call page, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
For reasons similar to those in the cache case, there is no need to include the 10
bits of page offset in TLB.
So here, you will have a quiz on these TLB caches with the full associative cache
style.
Kim, Eun J
Kim, Eun J
[Link]
So let's finish this song.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Who handles TLD misses?
Actually, there are some, like, lips
Operating system handles, software handles, some other machines.
TLPC is even handled by hardware, pure hardware.
Common cases, TLB is an interdisciplinary area, where operating system and
architecture hardware handled together.
Note that cache access, normal cache access, only available once we have a
translated physical address, which means we need to go through TLB accesses, right?
So can TLB and caching be overlapped?
This figure shows some idea here. You can have virtual page number just to fit into
a tag field, so while virtual page number goes through TLB to get physical page
number.
You have an index field is available in offset site, right? Page offset is big
enough to hold the index field, so that before even you have a tag, the translation
is done, you know which row of cache you should access.
And then data will be ready, right? So, by the time you have TLB obsessed done, and
physical page number is there, so you can do tag matching. If it is equal hit, you
can use data. If not, then you will handle miss, right? So it can be overlapped.
Then what would be the downside?
Kim, Eun J
Kim, Eun J
[Link]
What is the downside?
You have to increase your sensitivity sufficiently, and that makes the cache slower
than you might otherwise want it to be.
So, actually, there can be an answer for the downside. Okay, so, in this figure, we
don't have any associative cache. Look at this.
When you have associative cache, until tag matched, you don't know what data you
are looking for. There is no parallelism. You should wait until tag matched, and
then you go to the marks, and then choose data, right? Here, if it is directly
mapped, what happens? With the index, you already know what is a candidate, right?
And then you do byte offset, you can prepare data to supply. Meantime, you do tag
matching. Can you see that?
So we… if we have a strict mechanism like this shown, what's the limit?
So, let's say you're… Block size is fixed.
Okay, then, what happened?
Cache size is bigger. Cache size also fixed, right? Because to fit this index
inside of a page offset, your cache size, this first level cache, cannot be bigger
than page, isn't it?
So it's… you… you still… you have very, very small cache, okay?
So then, in order to avoid that problem, we can have associative cache, isn't it?
With the same index field.
Right? We can double the size Y. We can double the size if it is double, but then
if it is directly limited, you have one more bit, right?
Right? You can double the size, but then if you have a two-way, this last first bit
will be observed in the tag.
Do you know what I mean? So here, in the figure, this index was 2, let's say. You
want to double the cache size, then it will be 8, right? So originally, if it is
directly mailed, you will have 3 bits here, but if you make it 2-way, you still
have a 2-bit, this one is hidden there, right?
So those kind of things done. Actually.
These are real things that are happening.
The other, like, most commercial machines do for this, we need to go through page
table, okay? Page table is so long, so…
takes so long time, so we want to use a TLB, and then TLB can be missed, and then
also it takes time, so what we do, usually first-level cache, very crazy. We still
use a virtual address.
So, first-level cache organized with the virtual address. We don't go through
translation.
Okay, so this is how we started. This part, only this part, page offset, it doesn't
need to go through translation so that we can use the hardware.
access, right? So… That's, very…
Yeah, security out there can happen. We're using the virtual and the physical
address on the same level?
Nope.
So… So… You see, the cache is hardware, right, physical thing, so in order to
access cache.
The address you are having, dealing with, shouldn't be a physical address.
But user program, when you do compile, and then when you run the program, that is a
virtual address. You only have a virtual address, so you… every address of memory
access, you need to go through translation.
Okay, so then, if TLB hit is… TLB is a small, so it would be fitting to one cycle
of CPU clock cycle time, but if not, then it's a multiple cycle, right?
So then it's so expensive. So first level, you know, first-level cache, we want to
match with the CPU speed, so we let use virtual address as it is for first-level
cache.
A lot of time, they do something about this only page offset part, still use the
upper part as a tag.
Okay.
To provide the protection.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Here, actually, the size of a cache will be limited by
page size, you cannot have it bigger than that, right? So it's so in flexible
design, so we need to think about the other way around.
So, problems with overlapping TLP obsessed with the caching.
only works as long as the address is used to pack index into cache and not changed,
okay? So…
This usually limits things to small caches, large page sizes, or high and
associative cache, right? So then you will have a small cache index field.
So if you want to provide a large cache, you can have and weigh several associative
cash. Example here. Suppose everything the same except that the cache is increased
to 8K bytes instead of 4K, okay?
So then, the bits here, colored blue, is changed by VA translation, right? So one
bit will be different, but
It's needed before cache lookup.
So you have to wait, right?
So, solution for this, either you go to 8K byte page sizes, so you make a page
bigger, or you can go to two-way set-associative cache, so this blue part will be
included in tag area.
Or, software. Guaranteed that the virtual page number 13th is equal to physical
page number 13th. You can make sure it's that way, right? So there are different
ways of handling
This overlapping problem.
Kim, Eun J
Kim, Eun J
[Link]
Do you understand? We talk about making two-way. You can double the cache size, but
then you have a two-way, still your cache index is the same, right? So you don't
have to use this fixed one.
The other one is you… so, because this cache size will be limited by page size, you
can double the page size.
The other one, you make sure translation of this gray part, 13th bit, is the same.
If a virtual address was 0, then your physical address is 0, because translation
can be done by operating system. We can… we can force to do that, right? So we use
as it is, like, a virtual address as a physical address.
When we go through the caching itself.
That's how it can be done.
Okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Another way to overcome that problem is that you allow cache access with virtual
address instead of physical address, so then you don't have to go through
Translation.
But you can use virtual address ACDs, especially for level 1 cache, right?
So while you're accessing level 1 cache, you use a TLB translation logosite buffer
to get physical address. If it is missed, you can go to the memory with the
physical address.
So, you will use a TLB only on the dead level cache miss, right? The problem here
is actually fatal, okay? If there is a problem, the synchronym, okay, with the same
two memory address space can share a physical frame. Data may be in a cache twice,
okay?
You can have, two different, two different locations.
Maintaining consistency is a nightmare with this way.
Let me summarize what you just learned. Virtual memory was originally invented as
a.
Kim, Eun J
Kim, Eun J
[Link]
skip this summary because we opened this, okay? So let's go through… No, please.
So if there's a TLB miss, then OS comes into picture, right? To, you know, fetch
the page table from the… from the memory, and then… Yes, yes, yeah.
So, you… we'll talk about it. So, you go to… anyway, if it is missed, you need to
go to page table. Page table is another memory access, is in the memory, and then
you get that translation, then replacement happened in TLB.
Okay, so in TLB replacement, usually done by operating system, they use the use bit
and reference speed. So both of them, let's say 0,
then it will be kicked off. Like, they tried to have a LRU, okay, but it's kind of
a slow LRU, because we don't provide a counter.
Okay, when was the last time? So they use a combination of used bit and the dirty
bit together.
Very good question.
Alright, so this is the easiest question. If I… Ugh.
You know, have some space.
The final… this is it.
I think it's a smaller… okay, this is it. So this is a question.
Oh, I have a call.
So, all right. So, can you read and make a note, and then I can move up?
This cannot be fitting to one… monitor.
So these are space table.
17, virtual address, 16, and then page offset is 14, so you use the 3 bits of a
virtual address, virtual frame number, and then see the…
Page table, okay?
Then, here, you have a two-entry PLB, okay, 2 entry, you can draw 2-entry table,
okay?
So these are… Portal address, and in sequence, it was…
The first, can you show the final contents of a TLB?
Which is good.
Okay, so it's appeared in here.
So, first two, of course, it will be missed, right? If you're empty. 111, not
there. 100, not there.
111, they're hit.
So easy, isn't it?
Done?
So this is just the… instead of me asking what is a TLB, and then what do you… what
do you share… store, but this can… this exercise, I can test, right? You know what
is a frame number, and what you are supposed to store in the page TLB, right?
The 2 entry is, let's say there is a 2 entry.
Okay?
But usually, I can put this 20 like this.
Because there is no index, right?
fully associated. And the first one, what is it?
your frame number here, 1111, right? So you go here, where is 1111? Your table
implicit.
This is 1111, right? And it's 1, valid translation is there, it's not page 4th, and
then translation is 01. So you will recode, record both of them. Frame number 1111,
and then physical page number is 01.
Okay.
Keep going. Tell me what you have at the end.
Both of them.
Raise your hand if you're done. It's so quick, right?
Hone? Okay.
So, you'd go to the next one, 100, and of course it's not there, it's a miss,
right? This is a miss. Miss…
And 100… where is 100? 01101110. Okay, the translation is 11.
Okay? Then third one.
Your tag, Prime number is 111, right? There is a 111, it's a hit.
Okay?
The translation will be 0, 1, and then 00000, right?
So, how about the last one?
Is it 000?
Nothing mad.
Okay, so is LRU, we are using LRU, which one should be kicked out? The right one,
right? So this will be replaced to 000, and then you go 000, it is.
And the contents of TLB will be like this.
Okay?
Yes, thank you.
Thank you.
Okay. Oh, she just wrote the translation, so…
Change the size. Okay.
Everybody okay?
Let me move up.
Thank you.
reverse itself.
Alright, so this time, you finally start a new chapter. You are supposed to read.
And this should be fun, because nowadays NVIDIA does so well, you have a high…
motivation, right? And they're recruiting a lot, isn't it?
Release everything. Let's start with the… Intro.
So we won't jump into GPU right away, okay? So we will study SIMD in general, and
the classic SMD is a vector processor, because all those simple concepts is
repeated in SIMD extension and GPU, okay? So we will go there.
So.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Let's begin with Chapter 4, Data Level Parallelism in Vector, SIMD, and GPU
Architectures. This set of slides first introduce these three concepts.
A question for the single instruction multiple data SIND architecture has always
been just how wide a set of applications has significant data parallelism.
After first Finn defined SIMD, the answer is not only the matrix-oriented
computation of scientific computing, but also the media-oriented imaging and sound
processing and machine learning algorithm. Now it's very popular.
Unlike MIMD, multiple instructions, multiple data architecture needs fetch one
instruction per data operations.
Single instruction multiple data architecture is potentially more energy efficient,
since a single instruction can launch many data operations.
These two answers make SIMD attractive for personal mobile devices as well as for
servers. Finally, perhaps the biggest advantage of SIMD compared to MIMD is that
the programmer continues to think sequentially.
Achieves parallel speedup by having parallel data operations.
There are three variations of SIMD architecture. First, vector architecture.
Second, multimedia SIMD instruction set extensions, and then last, graphics
processing unit, GPUs.
The first one, vector processor, which predates the other two by more than 30
years, extends pipeline execution of many data operations. These vector
architectures are easier to understand and to compile to other SIMD variations,
that's why we want to learn it first.
Part of that expense was in transistors, and part was in the cost of sufficient
dynamic random access memory bandwidth.
Given the widespread of reliance on caches to meet memory performance demands on
conventional microprocessors.
The second variation, multimedia SIMD instruction set extensions.
The name itself borrowed from SIND name to mean basically simultaneous parallel
data operations, and is now found in most instruction set architectures that
support multi-media applications, for example, x86 architectures.
The SIND instruction extensions started with MMX multimedia extension in 1996,
which were followed by several SSE streaming SIMD extensions versions in the next
decade. And then, they continue until these days with the AVX advanced vector
extensions.
To get highest computation rate from x86 computer, you open need to use these SIMD
instructions
Especially for floating-point programs.
Kim, Eun J
Kim, Eun J
[Link]
To be honest, I have very low motivation to cover SIMD Extension Intel one, so…
They are kind of a fail, right?
like, market-wise and technique-wise. But they are so corrupt with the original,
like, x86 format, so they have such limitations in terms of, you know, exploiting
the parallelism, massively parallelism, you can achieve with multiple data process.
processing, so… but, I will cover, okay?
Because you may interview with Intel, too.
And I still hope Intel will survive, okay?
I'm confused. What's the difference between a vector processor and an SIMD
processor?
SIM… so, I would say SIMD, when we say SIMD, is more general, okay?
the, like, a… Theoretic… characterization. You can have, SI…
S, S-I-ID, do you, do you remember that we have a…
spins the full category, so that's the category name. In that category, like a cray
system, we learned, that's the original, starting with a vector. You have a vector
processor, and then they have each instruction, when they have a vector mode, when
you have a load, it's not load per word. You will have a load vector data.
And then when you add, you… well, you add the two vectors together, okay? Then SIMD
extensions is named for Intel extensions.
We just talked about. And then the GPUs. There are 3 big categories that belong to
SIND. So, vector instructions are a type of SIND? Yes, yes.
And so that… that probably means there's other types of SIMD instructions. What are
they? GPU?
And the Intel SMD extensions, MMX, They are all SIND, too.
That doesn't…
doesn't the GPU also do scoreless memory access, which is sort of the same concept,
where you are fetching… They're… they are categorized as the same, but the details
are different, we will…
Seriously, Sympti, are we referring only to fixed with SIMD here, or also, variable
with…
Oh, yeah, we will talk about it.
There will be.
Thanks.
Did you register?
Specify what the weight of… Yeah, that's congress settings.
That's… risks.
vector extension is a different vector extension.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
GPU. The third variation of SIMD architecture, GPU, comes from the graphics
accelerator community, offering higher potential performance than is found in
traditional multi-core computers today.
Although GPUs share features with vector architectures, they have their own
distinguishing characteristics, in part because of ecosystem in which they evolved.
This environment has a system processor and system memory in addition to the GPU
and its graphics memory. In fact, to recognize those distinctions, the GPU
community defers to this type of architecture as a heterogeneous architecture.
The problems with lots of data parallelism, all three SIND variations share the
advantage of being easier on the programmer than the classic parallel MIND program.
The goal of this
chapter is for architects to understand why vector is more general than multimedia
SIMD, as well as the similarities and differences between vector and GPU
architectures.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so we will start with the vector architecture.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we first discuss about vector architecture, one of our
SIND architecture variations.
The basic idea of Quaker architecture is that the architecture set of data elements
scattered in memory, place them into large sequential register file, operate on
data in those register files, and then disperse the results back into memory.
A single instruction works on the vector of data which results in dozens of
register-register operations on independent data elements.
These large register files act as compiler-controlled buffers, both to hide memory
latency and to leverage memory bandwidth. Because vector loads and stores are
deeply pipelined, the program pays the long memory latency only once per vector
load or store versus once per element.
Therefore, amortizing the latency over, for example, 32 elements, indeed, vector
programs strive to keep the memory busy.
Let's begin with an example architecture, RISC 5,
instruction set extension, which is called RV64V.
Note that this is based on CRA1, which is a 40 years old one.
The first primary components of RV64 instruction set architecture is vector
registers. Each vector register holds a single vector, and RV64B has 32 of them,
each 64-bit wide.
The vector register file.
Kim, Eun J
Kim, Eun J
[Link]
Sorry, there is a typo. It's a 64-bit, okay? You have a 64-bit register, 32 of
them, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
needs to provide enough ports to feed all the vector functional units. These ports
will allow high degree of overlap among vector operations to different vector
registers.
The read and write ports, which total at least 16 read ports and 8 write ports, are
connected to functional unit inputs or outputs by a pair of crossbar switch. One
way to increase the register file bandwidth is to compose it from multiple banks,
which work well with relative wrong vector.
The second primary component of RB64 is vector function units. Each unit is fully
pipeline, and it can start a new operation on every class cycle.
Control unit is needed to detect hazards.
Both the structural hazards for function units and data hazards on register
successes.
The third component is a vector load store unit.
The vector memory unit loads or stores a vector to or from memory. The vector loads
or stores are fully pipeline again.
So that the words can be moved between vector registers and memory with bandwidths
of one word per clock cycle.
Kim, Eun J
Kim, Eun J
[Link]
After initial…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Latency.
The last primary components of vector architecture is a set of scalar registers.
These are the normal 32 general-purpose registers and the 32 floating-point
registers.
The best way to learn how vector process works is to see the vector instruction
example.
So this is a vector loop example, double precision AX plus Y, where X and Y are
vector.
Kim, Eun J
Kim, Eun J
[Link]
Do you recall a scalar version of code that you had in the meeting, right?
So you load X, and then Y, and then you multiply a constant value, and then add to
Y, right? So these are vectors.
So we are actually handling the same code segment for loopholing.
commercial law, harder speculation, and then vector GPU, you will see all the same
thing, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
I strongly recommend you to compare these codes with the scalar code, which are in
page 288, okay? So let's begin with the first one. So you first enable four double
floating point registers, and then F load means your load scalar data, A to F0. And
the VLD means
your loading vector, okay? Vector 2, vector X, okay? And the V mole, as you can see
here, multiplication is… has a scale operand, right? So V mole, actually, the two
operands are… one is a vector, one is S, okay? VS.
Okay? And you multiply, and then you load another vector Y, 2V2VLDV2, X6, okay? And
then you add the vector, 2 vector. So, V add, actually, extension will be Vad VV.
VV means two vector operands, so you should differentiate them. Multiplication, you
need one scalar operand, so it will be VS.
And that SV means you have one operandi. The first operandi is scalar, the other is
a vector.
Okay? And then you will have a VST vector store, okay? So if you compare, okay,
this code to RISC-V general instructions, which is in page 288, you see here…
Kim, Eun J
Kim, Eun J
[Link]
What's the big difference here?
There is no loop. We don't have a loop. Can you see that?
So in the scalar version, every element you load, and you calculate and store back,
and then you go to next element, you do many iterations, right? Here, we don't do
it. We load the bulk.
vector level, and then you multiply vector level, you add a vector, and then you
store back vector. Okay, the number of lines are much smaller, and then we don't
have any branch. That's a very, very important thing, okay? We don't have any
branch, because we don't have a loop.
Can you go back to the previous slide?
Okay, I need to remember. Okay, this was, 6-18.
Oh my goodness, okay. So, when you say the register file has 16 reports, how many…
Bits or words is each port reading park clock cycle.
Hmm.
So, it… the…
the… from here, we can tell if it is, like, if you have a vector add, right? Then,
when you have a 2ED, and then you…
store back to another register, right? So you can have 8 concurrent editions.
Isn't it?
Because you have 16 read pores, you can read 16 data at the same time, if they
don't have any data hazard.
Right? And then you… you use the 16 data to produce 8 different output that will be
output port. So you can have, one, 8 different add operations at the same time.
Each read board can read one vector register worth of data from memory.
And so, in this case, we can fill up 16?
Vector registers per clock cycle, assuming there's no latency.
No, no, no, no. So, so, register file, look at this. You need to look at this load
store unit. The question you are having is load… related to load, isn't it?
Right, I guess I'm just confused about what read port means in this context.
Okay, so in the… if I go back to the RISC-V architecture we discussed, Here.
How many read ports do you provide? Where is the Zoom?
I disable zone.
You guys close Zoom?
No, here. Goodness.
Alright.
I was so surprised.
Okay, so when we have a register file, okay, usually in the interface, you have two
read port, so read address 1, read address 2 going in, and then you have 2 data
coming out.
Okay, this is a scalar.
And then you are writing, One time, one data. Right address, Right?
data. So in one cycle, you can have a two-reader at the same time, one write at the
same time.
Okay?
So does that answer your question?
Sort of.
I guess I'll have to revisit those chapters again, because details are a bit hazy.
Okay, so as long as these read addresses are different, so…
I don't recall, so this is, what usually how it looks like. You have register
files, let's say 64 bits.
let's say you have 16 of them, okay? Then one address coming, let's say it's a 16,
so you will have 14, 4 bits coming, right? And the 4 bits coming, then you will
have a 4x16 decoder.
So you have a 00000, then this line will be having one information. And then if
it's 0001, and then you will have another one. So you have just the decoder, this
is a 4x16 decoder, okay? Then one will be the
only one of them will be selected as a load instruction. Then these 64 data will be
shipped out as a data.
You are having the structure like this way. Register file actually looks like this.
You have a two-decoder when you allow two ports.
And then address will be fitted into two different decoders, and then you are
having another… this… Thanks, okay?
So, the… What it does is, let's say you have Ed.
R1, R2, R3, okay? This instruction. R2, R3 address will be fed into register file.
At the same time, you can read this data coming out. This will be carried into,
let's say, ALU, okay?
Carry into ALU, then you select add, then output is coming out, that will be coming
back to the right, and then this will be fit into the
write address. This is a very simple data path for a single. And then how about the
vector version? You have 16 of them.
8 of them, so which means you can have Ed, V.
Okay, so the one…
So add the V, okay? Then this will be replaced the vector data, vector 1, vector 2,
vector 3. So you are having, let's say, A vectors read at the same time, okay?
then you, you, you are using this, and then writing back to the… this. These all
happen, not scale one data, you are… it happened as a Burke data together, okay?
Okay.
Yes, it was a 6.15.
18.
Thank you.
All right.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Risk 5 general instructions, which is in page 288. You see here, we have only 8
instructions. There, you have a loop, so it'll be runtime, it'll be 258
instructions to execute. So…
And the way, also, risk normal architecture, general architecture, will be every
FADD, right, you add, you need to wait until multiplication is done.
Similarly, every store, FSD, also should wait until add is done. We talk about this
code, right? But on vector processor, each vector instruction will store only for
the first element, because it's fully
pipeline for each element in each… then subsequent elements will flow smoothly down
the pipeline.
Kim, Eun J
Kim, Eun J
[Link]
So, pipeline stalls are required only once per.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
vector instruction.
Rather than per element, okay? That's a huge advantage.
The execution time of a sequence of vector operations primarily depend on three
factors. One, the length of operant vectors. Second, structure hazards among
operations. And the third, the data dependency.
Given vector length and initiation rate, which is the rate at which vector unit
consume new operands and produce new results, we can compute the time for a single
vector instruction.
All modern vector computers have vector function units with multiple parallel
pipelines, we call it lanes, that can produce two or more reserves per clock cycle.
But they may also have some functional units that are not fully pipelined. For
example, this RISC-V vector implementation, we assume we have one lane with the
initiation rate of one element per clock cycle for individual operations.
Therefore, the execution time in clock cycles for a single vector instruction is
approximately the vector length.
To simplify the discussion of vector execution for now.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so what, what it means? So initialization, so in your, your quiz and then
this exercise, so let's say Ed takes up, initially, 7 cycles.
And then, your vector size is 64. It's fully pipeline means. Oh, first element to
get the first results, it will be 12 plus 1.
Initial, plus 1.
It's pipeline. And then, how about next one? It's fully pipeline, means
very next, you will be done with the second, third. So, it will be 12, roughly 12
plus 64, right?
So, but usually the vector size is much bigger than initial delay, so the
approximation you can see is just 64, right? It's just one vector unit time.
Okay? That's how we do approximation. Okay, we will do detailed calculation, but
when you look at this, the convoy and the terminology, this is all about
approximation.
Okay?
So…
This slide says the functional units consume one element per clock cycle. Does it
mean it reads one operand per clock cycle, or does it write one operand? No, no, so
when you say… when… as I gave you example add, so you… you're done with the, you
know, our… the first element addition in 12… initial is 12, and then
13th cycle, you are done, and then one element per clock cycle, after that, you
will be the second, third, like that. Every cycle, it's because it's fully
pipeline.
Produces, like, it produces one result part of the cycle, is what this is trying to
say. Because it's a throughput, right, pipeline.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
We use the notion of a convoy, which is the set of vector instructions that could
potentially execute together.
The instructions in a convoy must not contain any structural headers. If such
headers were present.
The insertions would need to be serialized and initiated in different convoys.
One may think, in addition to vector instruction sequences with structure hazards,
sequences with read-after-write dependency hedgers should also be separate convoys.
However, chaining allows them to be in the same convoy.
Because it allows a vector operation to start as soon as the individual elements of
each vector source operands become available.
The results from the first functional unit in the chain are forwarded to the second
functional unit. In practice, we open implement chaining by allowing processor to
read and write a particular vector register at the same time.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so this concept you will master with example. Okay, so we… we start with the
convoy. Convoy, you have… you have a set of instructions which can be executed in
parallel, okay? So, we will start with only the instruction without an instructor
hazard, okay? We can put them together.
How about data dependency? Let's say you have an add, F, like, a V0 and a V2, V3,
but then V0 is used, like, as a second multiplication. Is it… they can be in the
same convoy or not?
If I say chaining is done, chaining means between two different functional units,
you have a chain. The first element is done, you will deliver that without waiting
whole vector finish. Do you know what I mean?
So let's say your add is 4, and then your multiplication is 12. Think about first
element. After 4, 5th is done, right? Then… but it doesn't have to wait until 64
plus 12, until the last element is done. This will be forwarded to the next…
Functional unit.
Then, next functional unit, multiplication can right away start, right? So, then
how long it takes to get add and multiplication of first element done is just 4
plus… I don't know how many numbers I gave, let's say 4, 7. So you add 4 plus 7,
then you're done, right? Can you see that?
Without waiting. If you don't change.
Then it will be 4 plus 64. You need to wait until everything's done. Then you can
move to the moon.
the multiplication, right? So that's the chaining, whether you have an additional
wire to feed, you know, the… those pipeline data to different functional units,
okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Early implementation of a chaining worked just like forwarding in the pipeline
architecture we learned, but this restricted the timing of source and destination
instructions in the chain.
So recently, we defined this design as more flexible chaining.
To turn convoys into execution time, we need to have a metric to estimate the
length of a convoy. It is called a chime, which is simply the unit of time taken to
execute one convoy.
Therefore, a vector sequence that consists of M convoys executes in M times.
A vector length of n, for our simple RISC-V implementations, this is approximately
n multiplied by N class cycles.
Kim, Eun J
Kim, Eun J
[Link]
So, so earlier I told you, even the initial, initial delay is smaller than vector
length. Let's say 64 here, N is a 64. Then when you come up with the scheduling,
and then the chimes, when you count how many convoys you have, lines of code you
have, each line will approximately
No, it's a 64 clock cycle.
Okay, so if it's 3 lines, how many? 64 you have. 3, 64, right? That's the
approximation we have.
Okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So the chime approximation ignores some processor-specific overheads, remember
that, many of which are dependent on vector length. We will see more examples
later.
Let's revisit the concept of convoys and chimes with an example we saw.
So here, the first convoy starts with the first VLD instruction, and then V
multiplication, V mole is dependent on the first load, but chaining allow it to be
the same convoy, so you can put them together. The second load instruction must be
in a separate convoy, because there is a structure hazard. On load story unit.
And re-add is dependent on the second reload, but it can again be in the same
convoy via chaining. Finally, the store, VST, has a structure hazard on ReLD load.
in the second convoy, so it must go in the third convoy. So this analysis leads to
the layout of
vector instruction into Convoy.
This sequence requires 3 convoys. Because the sequence takes 3 times, and there are
2 floating-point operations per result, the number of cycles per floating-point
operation is 1.5, ignoring any vector instruction issue overhead, okay?
Note that, although we allow free load and remote both to execute in the first
convoy, most vector machines will take two clock cycles to initiate the
instructions.
This example…
Kim, Eun J
Kim, Eun J
[Link]
Do you have any question on this number?
Okay, so first of all, look at this. Do you see three lines? Okay, I will go back
how these three lines come up with, okay? Three lines. So each one, we… okay, so I
think this example, it was the element size is 32.
So, each one, we approximately say, oh, 32 cycles is required, okay? So, 32, 32,
32, so it's a 60.
I-96, okay? All right? How about this number? How they got this number?
you are having two floating points, right? So they… a lot of times, they look at
how many floating point operations you provide, right? So how many… the… each
floating point operation, how many cycles are required?
So you have a 3 cycle, but then you have a 2 floating point, that's why each comes
with a 1 point.
Okay? Usually, I ask this one, okay, how many cycles are required?
you can give a rough estimation, but the quiz we will do together… oh, I have time.
We can do it together.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
the chime approximation is reasonably accurate for long vector, okay? For example,
for 32 element vectors, the time in chimes is 3, so sequences would take about 32
multiplied 396 class cycles.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so…
We're assuming there's no latency. So this is another… was to love the quiz
question by me, which means… So there's no additional… They put them in the final,
right? I guess so.
I always think the understanding fundamentalism in forecast. This gives a good
understanding of, parents.
Spectral parallel.
So…
So, actually, even I use… I give a frequency. Unless I ask your execution time in
seconds, you don't need to worry about it, right?
Okay, I just want to be specific, so I gave, okay, a lot of students, oh, it's a
megahertz, and I don't know what to do. Unless it asks how many seconds, right? If
the question is all about cycles, it doesn't matter, you don't have to convert
frequency to clock cycle time.
So this fixed vector length is 64, and then this is original code, okay?
And then we don't need to worry about the streamlining, like, how you get the data,
how long it takes from memory, don't worry, okay? And then the entire sequence
produced 64 results, okay? And then the vector length is 64,
And the initial latency, we follow triggers 6 for Ed.
7 for multiplication, and division for 20, and load for 12, okay? We will follow
this assumption. Actually, it's a real crazy system. Okay, we don't have a
chaining, means if there is a dependency, it should be in different convoy.
Okay.
Right? And a single memory pipeline means you cannot have loads stored at the same
line. They have a structural hazard, okay? And then how many chimes are required?
Can you come up with a convoy with this?
you need to go through each line, right? So you put the first load. Can
multiplication can be in the same combo with the load? Because no chain, you can't,
right? So this. How about add?
There is a… No dependency, right? Isn't it?
Deep & Ed doesn't have a dependency… okay, can you draw a dependency graph?
Okay, can you draw a dependence graph? That would be the easiest one, right?
Okay.
Isn't it… So, this… First, the combo will be LV.
And second, myrrh, And Ed can be in… same convo, isn't it?
How about third one?
Lutheran?
They cannot be in the same. Why?
But these two stores cannot be in the same convoy Y.
Because of? Okay, convoy, you only look for two hazards. What is it? One, we talk
about data hazard. The other?
Structure header. They use the same structure, which means that you cannot do them
in parallel, right?
you have only loaded store one unit. It says, You have…
the… what do you… where is it? I do remember. Single memory pipeline. So you have
only single memory, one load, one store at a time, okay? They cannot be
parallelized. So which store you will put?
Store… We too… Or every four.
I will put 34. Why?
Why?
Any guess? Because add instructions take fewer clock cycles than multiplication
instructions? So, point 2, actually, coming from multiplication, takes longer time.
They are parallelized, they're, you know, they start at the same time, but most
likely, Ed will be finishing all this, so I will put Ed first.
Okay?
Alright.
That's it, right?
So, this is, how many times? 1, 2, 3, 4, and then you are having 64, so 64
multiplied 64, that's an approximation.
Okay.
But… I wanna get how many clock cycles in detail.
Why?
Why we do that, it gives a more understanding of how vector architecture works,
okay?
So, I usually draw this way.
Load… I wish I have a space here, but then…
Okay, load initially, it will take… 12, okay?
And then, after that, you have a… 64 lines.
for each element, 64 will be done, isn't it? That is the first convoy.
How about second?
You don't do chain, right? So you need to wait until this is done, the first convoy
is done, so multiplication starts from here.
How long it takes initial?
Can you?
7 can… yeah, okay.
So let's say it's a 7.
And then 64.
Fixed visuals.
But these are together, means you have two hardware. This loaded data will fit in
two functional units, so they will start at the same time.
So, Ed takes up how long?
6, and then it'll be… 64, okay?
Okay?
Then how about the third line?
I will take this one.
So from here, when you're done here.
And then it's a 12, right? 12, and then… 64.
And then 4, you're having multiplication. Multiplication is done here.
Bill? I'm sorry.
Phil, and then… 64.
So, Totas, what is it?
This… Plus… This plus this, right? This is the cycle time.
Okay?
So, can you try… The 2 and 3, and we can go over together next time.
Okay?
Don't go. You just stay and finish. How about that?
We still have an empty classroom, so you can talk with your friends, or… Good try.
Yay.
Don't leave the room unless you understand what is a chain, chime, convert, okay?
Because this is a final question! One of our final exam questions!
Hey, thank you.
That was the last…
I'm understanding.
I think it's just… Definitely not.
But the answer to a question does not specify… She will now reopen.
Right. So you're saying… You guys were talking about the instructions, so you
decided to group together in the… Oh, initially.
So these are the initial things. Okay, oh, I'm… I haven't started number two. Oh,
this is number one?
Yeah, he's on, that's it.
I think it's supposed to be pipelined, is the latency.
Fantastic.
So it's latency plus one cycle per element.
Actually, chaining, you can't even do that, you have to, like, mix it.
And you need to wait until the last element's done, that's 64 of 63.
Every element takes the same.
So it takes 12 vitamin C.
Thank you.
Thanks so much for these.
Is it after 12 cycles, first value is ready, or is it after 12 cycles, first value
starts? Yeah.
So, what's… that means that… But there's, there's…
What advantage we're getting on,
Why can't we implement the same thing on a single start? Like, are we doing that
kind of thing?
Then you can calculate them, and then you can write back home.
So, okay. So we can expect, like, there will be useful in the noise, or the
execution time?
If your… your… the nature of your application, it handles a lot… it's an empty
nature.
We can assume that it can give us… let's say, between the lowered and the lower
respiration.
Nov 12:
They're excited.
face it, you don't… It's so fun. I have no love. Yeah.
We are infinite and depth.
Yeah, so it's, like, 3 years of… to actually get them to agree.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we first discussed about vector architecture.
Kim, Eun J
Kim, Eun J
[Link]
It was a lot of life.
That's quite accurate, so…
In the middle of a creek.
Oh my god.
Sorry, maybe I have to cancel the class and then delay… postpone it to…
Friday. Today, I have a deadline for the proposal, and then the… this T's
university has a concern about contract form. I was so upset, because it's a
deadline right there. They will go home, because they won't work after hours.
And there's, three, and then I have a class.
Let's see how it goes in!
Because the, you know, nowadays, you know, the situation, it's so hard to get
federal funding, so I'm trying to get fund from the company. And company contract
is very different from federal agents. They want to get everything, and whatever.
I…
My work, everything is publishable.
So, I'm fine, as long as I can support my students, but then… let's see, I told
them to call me, but see, they don't call me.
I'm the only one worried about this, and when I get the funding, the 50% they are
taking. So, this is, as a graduate student, you should know how university works,
and, you know, when you prepare interview questions, you need to have that kind of
what is over…
So the universe, how much they take, right? Anyway, let's go. But the other time, I
made a mistake, so let me correct that first, okay?
Thank you for pointing out that. So, let's go over the first solution.
The first one here, with no chaining, and single memory, and then how many
chimes. The… everybody agree on…
this. Okay, so, what was… Convoys, we have…
So, load can be done with the,
So, without chaining, you cannot put instructions having data hazards, right? So,
multiplication.
And then the ad, they are… they don't have any data hazard, but they don't have
also…
Structural headers, so they can be together.
But not with the load, because the load needs to produce results so they have a
data hazard, and then store and store.
This is what we did, right? But the… the only thing I had… this is, let's say, 12,
and then…
64. As soon as it is done, it will start
The, multiplication 7, and then addition 6, and then 64.
And this will be 64, okay?
Then, as soon as one's done, this memory can be used, so this will be…
12, and then 64.
this memory, they have only single port, so you cannot have the two concurrent, so
you should have, at the end, after this, 12 plus 64. So this is a total delay. When
I show total delay, I discard the last one, right?
So you need to add the two together. Usually what I do… so this is a 12, 12, so
among this…
Like, I take a 6 because a 6 is shorter, and then 12…
Well, this is the initial delay, and then you have a 64, how many? 1, 2, 3, 4.
Okay? So you can see this is how many line times. You have 4, 4 multiplied the
element size. This is a loft estimation, how long it takes. But including this
initialization delay, you can have a more detailed clock cycle times for this.
Any question?
So you have… you… you guys ask about why we… like, do you see this is switched?
store of before, before, because before is…
produced by Ed, which can be done earlier, one cycle earlier. That's why I did a
small optimization, okay? Only change one cycle difference.
For your final, I…
wouldn't be that scrutinized, but this is a rough idea. Without chaining, you can
do this.
test…
Question? The switch in compiler, or is it… Yeah, switch in compiler. So we do act
like a compiler, right?
Okay, so let's go to the next one. Can I delete?
So, any volunteer for the next one?
Maybe we wanna have all together so that we can compare.
So, the next one is chained. So, how it's different? Chained means the instructions
having data hazard dependency can be together. So, which one?
So first combo, you have a load vector, and then it will… as soon as one element is
done, it can go to multiplication and add… broadcast this too, right?
Okay?
And then you cannot have SOR together, because…
structure hazard. So, with this chain, you only care about structure hazard, right?
So, since they have only single
memory pipeline, it is like that, okay?
Okay.
So then, what is the delay for this one?
So… You have Bill.
Let's say in 12, your first element is done, load, and then it's, chained, so… 7…
6 initialized together, and then the longest one is this 64, right?
Between Convoy, we don't do chaining, so the second one.
should wait until the one of operation finished first. So, let's take this one
again, okay? We do same optimization for this one. So, we are doing store for V4
first, and then V2 later. So then this can start at
This time 12, and then 64, then… Phil?
text people. Rough information.
If you're very picky about what this initialization means, let's say here, it
sounds like a 12, we take, like, first element, how long it takes to go, right?
Then the rest, actually, will be 63 instead of 64, but we do a rough estimation.
Okay, so this is a total time, so we'll be 12 plus…
6 plus 12 plus 12, and then Kumboy 1, 2, 3, right? So 64 will be 3. Can you see
that? This is a 64.
by 4, but then this is 60 by… 64 multiplied by 3, because the number of chimes,
chimes are 3, right?
That's it? And there'll also be chaining between the… Only the in convoy.
Oh.
In a convoy, we do chaining, and then with a different convoy, we don't do
chaining. We need to wait until everything is done, and there is a…
the corporate register.
Perfect.
Okay?
Okay?
Gotcha. Alright.
So then, how about the… Third one.
Okay.
Channing was the first convoy element is done, so the second convoy can start. She
said that Channing is only within a single convoy. So you do…
Chaining, and you have 3 memory pipelines, means…
You can have up to 3 loads stored, right?
Then, All this can be done in one convoy, isn't it?
Bye.
Yeah, once, yeah. So you have a chaining, so as soon as load…
done with the first value, you send it to this two, and then when this is done with
the first, and then it will send the data to store, that's it, right? So all it
happened in parallel. So…
Should I write down convoy? Convoy is everything in the line, right? So let's look
at the timeline.
So look at… compare with this second one, okay? So, it'll be same as… I think one
big conflict.
Here?
water.
Oh, for the second problem, why is it that for the first problem, we had a load as
its own convoy, then we had chaining, the multiplication and add are in the same
convoy as the load, but the stores are not? The load store cannot be in the same
convoy because of structural hell.
So when we do chaining, chaining.
Instructions having data hedger, the reader after-write dependency, they can be in
the same convoy, because we have forward wire.
We don't wait until 64 results are coming out. But between Convoy, we should wait.
So, there's one memory port to do loads or stores? Yes.
Should they… should there be one more cycle between the 12 and this… After the load
on the…
addition.
So… so I just take it this way. Oh, after 12, let's say it can be forwarded to the
multiplication and other unit? What is the initialization time, so that… Yes.
So, yes, if you, you want to be really, It's not.
Are you exact?
thorough analysis, there are two ways to interpret what is an initialization.
So, if the way we did earlier, we set all 12, and then after that, you are having
reserves coming out next cycle. So, it'll be 12 plus 64, okay?
So here, we are having 12 initial cycles, and then the next cycle, the first
element results coming out, right? So, what I assume, you forward it at the same
cycle, so it will start at the same cycle, okay?
But then the, you know, next cycle will be, okay? If we had, three memory ports,
could we have a single convoy for this? Yes, that's the answer. That's why I don't
have to write down that convoy, because it's all in one line, okay?
But then the… the time… analyzation will be different. How? Tell me.
Because of changing first load, And then… what is it? Multiplication.
Multiplication is 7…
Add the 6, done, at the same time. As soon as the first multiplication is done,
what do you do? You do…
start.
Soar, which is 12.
Okay? And as soon as first element of addition done, you can also start Thor.
Okay?
Isn't it?
All right? So these are all sequence of, initialization, okay?
And then you will have a 64, you will have a 64, you will have a 64, you will have
a 64, you will have a 64. Which one is the longest one?
This one, I'm not good at drawing, but this is the longest one, right? So you go
back, so this, I will write down the back word, 64, and then 12, 7, 12, okay?
This is the… Clock cycle time.
All the, yeah, all the additional options?
Okay, so, so, okay, let's understand what is a vector processor, right? Vector
processor, let's say you have,
let's take one… one instruction. So let's say store, okay, store vector, addresses
RB, and all the vector, 64 elements in V2, okay?
What you're doing is, you start to store this 64 data at that time.
Without anything, you do handle 64 data at the time.
So, what it means, if they belong to a convoy, different convoy, you don't do any
chaining. Chaining means even their different instruction, we kind of treat vector
data as a
Individual data. If one data come in, I will start to addition. But we don't do it
here. When we write down this code, this is the protocol, right?
You, when you have a store vector, you have an address in RB, then V2 is a 64
elements data you are having, you will store all vector data to that location.
So, when you start to store in separate convoys, you assume breed 2 has all 64
results are ready to be stored.
Okay, you cannot overlap the execution between convoy.
Okay?
Alright.
This is the easiest to learn, right?
Okay.
Is it possible to know for that?
tables?
Do you want? You wanna turn off?
Okay.
All right. I'm not functioning well today. My old mind is on the proposal. I'm so
upset about it, and then he didn't reply yet.
Okay, they said that they're gonna have a meeting at 4, so I will finish this class
sharply at 4, okay?
Is it possible to overlap any of the initializations?
Initializations. Okay, so initialization, I would take this way. This is all
pipelined, okay? Pipelined.
If you learn how to do addition, there is a hardware structure, right? But that is
pipelined, and which will take 6 cycles to compute one element.
Yeah, yeah, yeah. so, when I go through 6 cycles, the next is… next element will be
marching together. It's all pipelined.
Can you imagine? So, 6 won't be used exclusively by one element. Six cycle is
pipelined, so it's all, you know, you can think there are six
stages?
four other, and one element marching to each stage, and the other consecutive
elements marching together. So when first elements is done.
6 cycles, Right after, second element's coming up. Third element's coming up. It's
because pipeline.
Does that answer your question? For… I guess I'm thinking, like, the load takes 12
to start, and then on the 13th cycle, that first element is… Out. Yeah, out. We're
assuming that. And then 15th cycle, second element is coming out. Yeah, it's all
pipelined.
Okay, so we can't overlap until the first element is… Yeah, yes, yes, yes. First
element is coming off from load at… after 12, then it can go to other for the 6th
cycle.
Okay?
Okay.
And so, you're done with this example?
Someone asked about curving.
Yes, I do curve. I do curve. So, the average is 60, right?
And, so usually before, the average of a meeting was a 75, and this semester, a
little bit lower.
So, historically, the students having Around the average, You have…
if you do other things well, because other things are easier ones, right? You have
homework and quizzes, as soon as… as long as you do, you have credits.
And Tom Project, if you diligently work, and…
There are some, you know, scales difference between the top team and bottom team,
but
It won't be that harsh compared to midterm and final.
So… Yeah, you have a hope for A, but below average, if you're far away from average
A.
Yeah, I think… Yeah, mum… B or C.
So, it's hard to… Predict now, and if students improve score for the final allot.
I really take into consideration, okay?
So your final… a lot of… sometimes, I saw the top scores from midterm, they screw
up final, and they… they get B, and they got so upset, why? But then, you know,
it's…
Final, final is, right?
So, do you have any other questions? I cannot give you an exact number now, but I
see, usually I see the distribution, then try to find the place to cut.
Okay.
I know, it's, yeah, this time midterm was a little bit lower, so I,
I don't know. But it says 60, so… Thanks.
My number is blank.
When is the deadline for Qudra?
It's past?
You still have time, right?
I don't know if it's… if your goal is getting B… Not guaranteed.
All right, so this is a basic way of vector architecture works, and let's look at
the… what kind of optimization they do to improve performance of, A lot of these
optimizations reappear in…
GPU architecture and SIMD extensions of Intel machine. So, it will be good to
understand with easier, simple architecture.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we're going to discuss 7 optimization techniques of the
basic vector architecture we learn.
So far, we learned the basics of vector architecture. Let's look at 7 optimizations
that either improve the performance or increase the type of programs that can run
well on vector instructions. In particular, we will answer this question first. How
can a vector processor execute a single vector faster than one element per clock
cycle?
Multiple elements per clock cycle can improve performance.
Second question. How does a vector processor handle programs where the vector
lengths are not the same as the maximum vector length of MVL? Because most
application vectors don't match the architecture vector length. We need an
efficient solution to this common case.
Third question is, what happens when there is an if statement inside the code to be
vectorized? More code can vectorize if we can efficiently handle conditional
statement.
Next.
Kim, Eun J
Kim, Eun J
[Link]
This statement about how to handle branch is very, very important, any SIMD
architecture, okay? So you need to…
understand how it works in vector, and then vector architecture, and then GPU, how
they handle, okay? So there are a big portion of GPU architecture, how they handle
the if statement.
Because when you have a big loop, let's say the image, right, 120…
1024 by 1024, like, even, you know.
machine learning. And then you will have a simple multiplication in the same
operation over and over, then it's easy to paralyze, right? You do exactly the same
thing. But in the middle, if this value is not equal to zero, you do this, if not,
like that. There is a…
if statement, then we cannot, paralyze easily, right? Because their execution time
is different, and they're… they have a divergence of a…
Control floor, right?
So that's a very important problem. And other than, you know, none64, you already…
we already discussed this idea when we talk about loop unrolling. So, let's say if
you do loop annoying 3 times, and then your number of execution repeats are, like,
4. What do you do?
I'm not quite yet.
16. Okay, so you do loophole-rolling 3 times, and 16 is not 3 times value, right?
What do you do?
You do loop rolling, the version of 5 times, right, you have a bigger loop.
Make sure you do Lupon null diversion 3 times, 5 times, and then you will have a
cleanup code for the remainder, right? You will have a quotient and remainder.
Exactly the same thing we are doing here, okay? So if it's bigger or smaller, you
can do that. It's just common sense.
Okay?
And the memory is important, let's see here.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
What does a vector processor need from the memory system? Without sufficient memory
bandwidth, Vector execution can be limited.
Kim, Eun J
Kim, Eun J
[Link]
So this, also, you can connect to the current trend. Why Hynix, Korea Hynix stock
is skyrocketing, right? Why Samsung had so much trouble when they gave up HBM, they
had a wrong decision at the time, and then they regret, and then the people in the
line of decision all fired.
Okay. It's a… it happened real time, actually.
games.
Why? Because, the… whenever we have a vector SIND architecture, memory is very
important. You need to sustain the high bandwidth
Right? You're having a lot of computation power, but what about memory? You need to
ship the data so, you know, flurry. So HBM, you learned that, right? So HBM is, in
a way, proposed to improve the inner
bandwidth improvement.
So these are kind of a very classic problem, right? So vector architecture has been
there so long, but when we are having machine learning error, that all these things
are, you know, coming back.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
How does a vector processor handle multiple-dimensional matrix?
This popular data structure must vectorize full vector architecture to do well.
Next, how does a vector processor handle sparse matrix?
Kim, Eun J
Kim, Eun J
[Link]
This is the one, okay, so when AI accelerated, like a machine… the hardware design
first coming up is already several years, like, 10 years ago, I was late. I…
To be honest, anyway, I don't want to go for my philosophical speech.
When everybody's jumping into, you know, machine learning application and what kind
of hardware we design, and the first set of proposals is all about sparseness.
like a convolution, CNN, DNA, if you look at convolution.
You have a huge image coming, but only there are a few important edges exist in big
image. Others are mostly zeros, and when you have a weight value, it's all zeros.
But then we save all the zeros in the memory, right? And we do, like, multiply and
add with the zeros.
It's all unnecessary computations going on. So, everybody worked at the time.
nowadays, you know, great guy in AI system design, whatever, those people start
their career on AI hardware design with this sparseness.
Okay? They want to compact the model, they want to compact the weight value and the
computation, okay? So these all… everything coming up, okay? Again. That's why I
think…
Learning… The fundamental ideas, fundamental concepts are very important, because
these are coming best, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Okay.
Sparse mixes is popular, so we need to have data structure for that. The last
question will be, how do you program a vector computer? Architectural innovations
that are mismatched to programming languages and their compilers may not get
widespread use.
Kim, Eun J
Kim, Eun J
[Link]
So, you took a data structure, right?
So, when we have a matrix, what are the, data structure you are having? Like, let's
say, 2D dimension matrix.
You have array, right?
The array is not good for sparse matrix, right?
So what do you learn?
What do you do?
like, a man… 1K by 1K elements.
There are only 10% non-zero value.
Okay.
So, will you still use a 1K by 1K array?
You do what? You do? Yeah, like, you can go through column, and then index, and
then non-zero value, linked list, something like that, right? You need to have a
linked list. Now, non-zero values, you don't know how many values are non-zero
values, right? Then there are, like,
the other… issues coming. When… what is the problem of a linked list compared to
the
All right. Memory alignment? No, we… record the chapter you… we just finished.
Cash.
Can't do prefetching, and… That is close, right? Prefetching won't work well. Why?
Because… so, okay, array means what?
Contiguous, so cache works well, right? Can you see that?
Even all zero, but then you… once you access 0 value, 0102, all they are in the
cache.
Right? But when you change the linked list, linked list is a dynamic, isn't it? So
you use a heap area, and they are not contiguous.
Right?
Then… Whenever you load the data.
There is no special locality. Maybe you have a high misrate.
Okay, so all those things that, you know, answered in earlier accelerator design,
and they are still important.
There are still… people are working on that, okay?
So everything is not new, okay? So…
the data structure, you think is a boring class, but very important. And so people
come up with a, you know, big graph, a big matrix computation. What kind of a data
structure we can have so that we can…
Improve the locality of a memory access, and your speed up will be humongous, okay?
It's not that difficult.
So, I know a lot of you choose memory, like, a cache…
replacement policy like that, and then… but when you do your work, yes, you can do
minimum to, you know, earn the points for credits for final term project, but
if I were you, I will try to see how this replacement work, you know, idea work
with modern applications, like AI applications, and with a
a lot of sparseness there. Like, when you have a different… like, there are
benchmarks, okay, that you can try.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
To achieve higher throughput, beyond one element per clock cycle, we are using
multiple lanes. A critical advantage of vector instruction set is that it allows
software to pass a large amount of parallel work to hardware, using.
Kim, Eun J
Kim, Eun J
[Link]
So, multiple lane is the belong to one. So, we talk about 7 optimizations in
earlier slide, right? This is the first one. What was the first one?
Can we handle more than one element per cycle? So this is the one we discussed so
far. You have an ad, and then you feed the data.
Right? And then addition takes 6 cycles, it's in all pipelines, so first add,
second add, you can do one by one, but then what is it? If you have multiple
adders, you can do multiple lane. Data can be, you know, spread out. And actually,
this is the way the vector, the GPU designs this way.
when they have a one SM, they have a 16 threads going on. Usually, the… these
registers are, you know, partitioned to the 16 ways, and your
functional units, at least you have 16 of them, so all 16 can feed at the same
time, they will be like that, okay? This is how we can handle more than one element
per cycle time.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Only a single short instruction. One vector instruction can include scores of
independent instructions, however, be included in the same number of bits as a
conventional scalar instruction.
The parallel semantics of a vector instruction allow an implementation to execute
this elemental operation using a deeply pipeline functional unit.
The left two pictures show how to improve vector performance by using parallel
pipelines to execute a vector add instruction.
Risk-V vector instruction set has a property that all vector automatic instruction
only allow element n of one vector register to make part in operations with element
N from other vector registers.
This dramatically simplifies the design of a highly parallel vector unit, which can
be structured as multiple parallel lanes. As with a traffic highway, we can
increase the peak throughput of a vector unit by adding more lanes.
Right-most picture shows the structure of a four-lane vector unit. Therefore, going
to four lanes from 1 lane reduces the number of clocks for a chime from 32 to 8.
Kim, Eun J
Kim, Eun J
[Link]
So this chapter, the way I would prepare for the final, each, let's say, multiple
lanes, I would use my own words, try to write down what it does.
And then, what it helps, right? You want to have more than one element per clock
cycle. So you try to understand the key idea, because as you can see here, can I
make an example? Do you have a quiz on this?
A lot of concepts covered in this chapter and next chapter
I cannot come up with… I cannot check whether you understand this concept with the
examples that I could go through. I prefer that way. You already know, right?
Thomas Holo Hardware Speculation, I never ask what it is. I just give you example,
show me how it works, right?
But these things, I can't do it, so maybe I will check whether you understand these
things conceptually, okay?
So, the… this chapter, when you do read assignment, I would attempt every keyword
and then try to summarize by your own way, okay? And check with the slide and the
textbook if your understanding is correct.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
For multiple lanes to be advantage, take advantage, both applications and
architecture must support long vectors. Otherwise, we don't see much of advantage.
To handle loops, which is not equal to 32, we're gonna use vector length registers.
Vector register processor has…
Kim, Eun J
Kim, Eun J
[Link]
Okay, so vector length register, why it exists?
What's the answer? It is about second optimization. Can you, if you have a slide
you can go back to, that is how to handle
more or less than 64 elements, because your… your vector size, the default to size
64, right? Okay?
This is about.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It has a natural factor length determined by maximum factor length, which is
MBL. This length, which was 32 in previous example, is unlikely to match the real
vector length in our program. Moreover, in real program, the length of a particular
vector operation is not open… unknown at compilation time.
So let's look at this example. The size of all the vector operations depends on N,
which may not even be known until runtime.
The value of N might also be a parameter of a procedure containing the preceding
loop.
So therefore, subject to change during execution time.
The solution to this problem is to add a vector length register for L here.
The VL controls the length of any vector operation, including a vector load or
store.
The value in the VL, however, cannot be greater than the maximum vector length of
MVL. This solves our problem as long as the real length is less than or equal to
MBL.
This parameter means the length of vector registers can grow in later computer
generations without changing the instruction set.
What if the value of n is not known at compilation time, and thus may be greater
than NVL?
To tackle this problem where the vector is longer than maximum length, the
technique called stripe mining is traditionally used.
Stripe mining is the generation of code such that.
Kim, Eun J
Kim, Eun J
[Link]
I would write out… I would write down stripe mining. Okay, it just explained the
concept of strike mining, right? Stride mining is used when?
The vector size is unknown at the compilation time, and then if it is bigger,
right, Okay, so…
So, the…
Final will be… won't be comprehensive, because I didn't return your midterm paper.
Usually, if I return your meeting paper, one or two questions from midterm will
appear in the final, but we won't do it, so your final will be only after midterm.
So, memory hierarchy, okay?
So you will have, more time.
Because I cannot put a lot of cash.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
that each vector operation is done for a size less than or equal to the NVL.
One loop handles any number of iterations that is a multiple of MVL, and another
loop that handles any remaining iterations and must be less than MVL.
Risk 5 has a better solution than a separate loop-for-loop stripe mining.
The instruction setVL writes the smaller of MVL and the loop variable N into VL. If
the number of iterations of the loop is larger than n, then fastest loop can
compute its MVL values at time, so setVL setsVL to MVL.
If n is smaller than NVL, it should compute only on the last n elements in this
final iteration of the loop.
So you can look at this code in detail.
To handle if statements in loop.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so the third optimization, what if you have if state, okay?
But a lot of times, when we do vector, vector data handling, this if statement is
kind of a simple, like this example.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Vector mask registers will be used.
The presence of conditions, such as if statement, inside the loops, and the use of
a sparse matrix are two main reasons for lower levels of a vectorization.
Programs that contain if statements in the loop cannot be run in vector mode using
techniques we have discussed so far. Because if statements introduce control
dependency into a loop.
We cannot implement sparse matrices similarly efficiently using any of our
capability we discussed so far. So let's look at how to handle condition.
With this example, this loop cannot normally be vectorized because the conditional
execution of the body. However, if the inner loop could be run for iterations for
each Xi not equal 0, then the subtractor could be vectorized.
The common extension for this capability is using vector mask control.
So, risk I vector predicate registers, which is here P0,
holds the mask, and essentially provides conditional execution of each element
operation in a vector instruction. These registers use Boolean vector to control
the execution of a vector instruction, such as conditionally execute instruction
use a Boolean condition to determine whether to execute a scalar instruction.
When predicate register in this example, P0, is set, all following vector
instructions operate only on the vector elements whose corresponding entries in the
predicate register are 1.
Therefore, entries in the destination vector register that correspond to a zero in
the mask register are unaffected by the vector operations.
Kim, Eun J
Kim, Eun J
[Link]
Okay, go. Let me repeat what it does. Okay, did you understand how it does?
So, you can think this way.
Now, the… We have abundant computation units.
Okay, so the subtract, we will do for every element.
Okay, so you have a 64, this will be paralyzed, 64. We have a 64 over subtractor,
it will take only one cycle, isn't it? Right? Then the results, when you store to
this SRXI, we will do selectively.
only when this is not equal to zero. You use a mask register, where here you can
see this P0. P0 is a mask, so you mask out, alright?
why not avoid running the computation if we know we will ignore the result?
So, it… so think about it,
So you have a 64 computation units?
And you do use that subtract selectively.
checking the value. So there is the flow…
diversions. So at the end, for example.
If it's zero, you will skip. You will skip the line. If it's not zero, you do this,
right?
So in one thread… so in the… so you… you start all 64 threads at the same time. You
have a code for each element. And then one goes to reduce computation, the other
don't go. The time you finish.
is a match. The convoy thing, wouldn't it? If you just wait till the convoy is
done, and then it would finish earlier. Like, if you had a double divide, there's a
huge number of pipelines, you could just skip the instructions that you don't need
in your… Oh, okay, okay. So, so,
But now, we are having multi-lay.
Multi-lane. Okay, so let's have an extreme case.
We have a 64th subtractor.
And then we have, each element you are using one thread. This is, actually how GPU
works, okay?
And the… I didn't show the rest of the code, you have a rest of the code marching
together. So when a thread, like 64 elements, you do computation, when they are
done, you move to next…
computation. But if they march together, it's easy to control.
But some of them, because of zero, you finish earlier. But anyway, you will need to
wait until others done.
Right. Okay, so is that in SIMD, having diversions of an execution control flow is
a nightmare.
Okay, so all as, like, a dis-vector architecture and GPU, we try to avoid that. And
then even Intel.
Itanium 64, they have a simple… similar like this one. You know, every time when we
have a branch, we have performance degradation, right? So they use a predicate.
They have… they try to have one instruction, no… no deviation. So deviation means
if true, you go here, if not, you go here. But if you make it one instruction, it
happened in… with one instruction, in terms of a control flow, you have still
sequential execution of each instruction.
That's why they provide a conditional move. If you learn the Intel x86, Intel has a
CV conditional C move W. So they have a conditional move when they copy the data,
only when the condition is true.
That can be done in one instruction, so you would get rid of if statement. Very
similar. So you want to have a more holistic view, okay? Even the CPU design,
people try to avoid having branch, because a branch takes up, you know, you… we
learned a lot, right? Branch prediction, if it is wrong, you need to roll back, and
it is really killing the performance.
So, the vector, they have a mask, and the GPU, when we get to GPU, it says they
have also a similar way. They have a mask flag. So, the computation-wise, as if
there is no if statement, every 64 elements will be
Ready.
They do subtract.
And then when they save, okay, subtract is done, everybody. But when they save,
they save only V0
is not set, okay? They mask out.
Okay?
This is the techniques, SIMD architecture use to handle the, branches.
You can predicate and assume which instructions are disabled. Is it from E sub and
E sub?
So, okay, all these are, like, they are preparing P0, okay? And then, when you have
as the store vector, they look at this,
Mask register?
And then only not set the one is saved.
Okay.
So this is a very common technique, and very popular, okay? So you can think, like,
either whether Xi equals 0 or not, number of instructions you go through is exactly
the same, isn't it?
So every thread will be finished at the same time, synchronously.
So, similarly, the recent… I told you, our group paper got accepted in
HPCA. What we found, like, when you have distributed training, you can think that
you have sole workers, and then with the partial data, you train, and then you do
reduction. But then, when you exchange data, we can do compress.
data. But then, when we do compression, they… due to the data difference, the
compression rates are different. So we work, like, four people working together,
and then once I get… I send the data to her, she sends it like that. It is
synchronous, but one… when we do compress.
I can send the data, this data can be… arrive very quick, but whereas the other is
long, like, there are variations.
And that… that doesn't work… help to improve performance, because you want to have
the same size, right? Originally, it's the same, but 2B. So we want to reduce, but
it's… it's synchronously. We want to make it so, you know, my PhD, your TA,
actually, trying to come up with a splitting algorithm, so that the…
Based on history, architecture, we always use history data. History data, she come
up with a nice way to cut, to have a similar compression rate, the size data, chunk
size, so…
Similar thing, though. Whenever we do pipelining, parallelization, these are
important. We want to march together, okay, so that you can understand this idea.
Why we do the vector mask register masking out.
You know already this is wasting your… what?
resource, right?
Actually, it is a maintenance time. It consumes more power, right? Having useless
computation.
Right? But still, the gaining is more, because it happens at the same time. You
don't have any variations, because one thread is coming off so quickly, then it's
still holding this PE, processing units, and they keep
soaring, right, until the longer one finish, and then you move up together. So.
The… actually, there was a slight overhead of more energy consumption, but still,
it's better than the spinning, the waiting for the other longer thread finish.
Because they can… they cannot move… You know, letting a synchronous…
execution is not allowed in the SIND. We always treat a thread as a blockbuster. We
will talk about it later.
Okay.
Did you understand this? So, for your information, maybe you want to search CL
move, like a conditional move in the x86? Like, Intel does that.
Okay, if it is a simple condition, they will… they wanna… they try compile to use a
conditional move, so this is the same.
if it is an Intel machine, they will have a computation, but at the end, they will
have a CL move. So they will move this data output to the destination only when,
you know, this condition is true.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
However, using a vector mask register does have overhead. With scalar
architectures, conditionally executed instructions still require execution time
when the condition is not satisfied.
However, the elimination of a branch and associated control dependencies can make a
conditional instruction much faster even if it sometimes does useless work.
Similarly, vector instructions executed with a vector mask still take the same
execution time even for the elements where the mask is zero.
Despite a significant number of zeros in the mask, using vector mask control may
still be significantly faster than using scalar mode.
To supply high bandwidth for vector load store units.
We are going to use a multi-bank memory system.
The behavior of a load store vector unit is significantly more complicated than
that of arithmetic function units. The startup time for load is the time to get the
first word from memory into a register. The rest of the vector can be supplied
without any stalling.
Then the vector initiation rate is equal to the rate at which new words are fetched
or stored. Unlike simpler functional units, the initiation rate may not necessarily
to be one cloud cycle because memory bank stalls can reduce effective throughput.
Typically, the penalties for startups on low destroyer units are higher than other
units, like 100 clock cycles for most of architectures. Here, we assume the startup
time
for RISC-V vector, it's 12 cycles, like cray 1. To maintain an initiation rate of
one word fetched or stored per clock cycle, the memory system must be capable of
producing or accepting this much data.
Spreading across Multiple independent memory banks usually delivers the decider
rate.
Mostly…
Kim, Eun J
Kim, Eun J
[Link]
Do you understand what is a multiple bank memory?
So nowadays, we… even you purchase the DDR, they are all multi-banked.
They don't… you don't have a warm bank memory, okay?
So, is this, using the multibanking, is that required that the memory accesses are
all sequential? Yeah. No, it will work really well if
the access is sequential because these are interleaved, right? You have first bank
in the first… first block in the first bank, and then second, like that. You do
interleave, right?
So, if it's more sequential, you're gonna, exploit more parallelism between the
different banks.
And we talk about it.
We call it bank conflict.
Okay, that is coming.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Most of the vector processors use memory banks, multiple memory banks, which allow
several independent accesses rather than simple memory interleaving for three
reasons. First.
Many vector computers support many loads or stores per clock cycle, and the memory
bank cycle time is usually several times larger than processor cycle time. To
support simultaneous accesses from multiple loads or stores, the memory system
needs multiple banks, and it needs to be able to control the addresses to the bank
independently.
Second, most vector processors support the ability to load or store data words that
are not sequential. In such cases, independent bank addressing rather than
interleaving is required.
Lastly, most vector computers support multiple processors share the same memory
system, so each processor will be generating its own separate stream of addresses.
With all these reasons, large number of independent memory banks are desirable.
Let's look at an example.
Kim, Eun J
Kim, Eun J
[Link]
So, so before we go to the example, let's a little bit talk about the lowest or
non-sequential words, okay?
Well, when we have AI applications, there are big input, big weight matrix, big
weight, so mostly sequential, okay?
So, still, when we have a multi-bank memory, it's more interleaved with the block
level, okay? There are some researches, okay, having independent bank addressing,
they're using some prime number, there was,
the Chinese mathematician come up with the prime number thing so that, you know,
you can have a different way of mapping to different memory backing to reduce the
number of conflict.
However, those, you know.
Mathematically correct, I cover that in all the time, but I don't do it because it…
it is really depending on reference lists, right? Those patterns. Patterns are all
dynamic, and they are not predictable, okay? So, when we have a mathematical
approach, it will work well when we have a large
number of tensions. So, at first, when… first year, my PhD, because I love math,
and I noticed in architecture community, we don't have much of a formula and
mathematical things, and then I, like.
we only compare experimental results, right? And why? I was really curious. And
then, think about it in, for example, telecommunication, when we analyze the number
of calls and conflict, we handle how many calls.
There are many, many calls, and so we can apply mathematical property, memory
lists, like that, so we can beautifully model it, right? However, when we have a
CPU design and like that, do we have that many of instructions? No, we really, how
can I say?
It'll be good to use a mathematical model for estimation, because you do abstract a
lot of complicated behavior, because the system is integrated.
Very, you know, interact to each other.
But…
That's what I conclude, okay? I was very curious why these people never attempt to
do… use a mathematical model to do, you know, come up with the execution time, or
CPU, whatever, but
the… we don't have, many, many instructions which are independent to each other.
There are so much dependencies, so it's hard to capture those behaviors through
mathematical form.
So we do use a cycle-level simulator, which you do. So this is our culture, okay?
In the architecture community, we believe in numbers.
And as long as you use a well-known simulator, and people will buy your
experimental results. Even you in… for your time projector, you use this GSIM, and
then you found, oh, you changed something, and then it's much better than the… this
year micro paper, you should tell me, and we need to write paper, okay, right away.
So that's a culture thing.
Alright, so let's look at some mathematical rough estimation for this example.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
You have 32 processors, each capable of generating 4 loads and 2 stores per clock
cycle. And the processor clock cycle is 2.167 nanoseconds? Okay.
Here, calculate the minimum number of memory banks required to allow all processors
to run at the full memory bandwidth.
Okay? Here, you can think of the maximum number of memory references each cycle,
first calculate, right? So it's a 32 times 6 reference per processor, so it would
be 192.
Kim, Eun J
Kim, Eun J
[Link]
Did you get the first time? 32. Each process, they generate full load to store,
every cycle.
Okay, so this is the total number of requests. How long? Like, this is the… the
second term is the ratio between memory cycle time versus processor cycle time. So
this fast, like, one memory cycle, you are having around the…
6, 8, 8 times, right? 8 cycles, so you need to multiply 8.
Then this is a total number of requests come to…
memory, every memory cycle, okay? 1, 13, 30, okay? So, ideally, each request goes
to different memory bank, you can sustain this request coming. Can you see that?
Yeah, this is the way we rough estimate.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Then each SRM bank is busy for 15 divided by 2.167, which is 6.92 clock cycles.
Okay, so we will round up to 7 clock cycle time. Therefore, we require a minimum of
1344 memory banks.
So this is a really, big number.
To handle multidimensional arrays in vector architectures, we need to understand a
stride concept.
Positions in memory of organization elements in a vector may not be sequential, so
let's look at this code.
We could vectorize multiplication of each row of B with each column of D, and
stride mine the inner loop with K as index variable.
To do so, we must consider how to address organization elements in B and in D.
When an array is all located in memory, it is linearized, right? So it transfers to
one dimension.
must be laid out in row major order, most of the machines, like a C. Note that with
protron, in protran, it will be column major, but most of the cases, we don't use a
protran, so let's stick to row major order.
Kim, Eun J
Kim, Eun J
[Link]
Have you heard about Potron?
Long, long time ago, Portal was the kind of a default one, that is column measure,
okay? But nowadays, we don't see any column measure. C and every other language is
a basic architecture. We have a basic measure.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This linearization means that either the elements in the rule or the elements in
the column are not adjacent in the memory. For example, the
preceding C code, like the code you are looking at, we are having rule measure,
then the elements of D that are accessed by iteration in the loop are separated by
the rule size, times 8, right, because we assume 8 bytes, for total, 800 bytes.
You should remember, in chapter…
Kim, Eun J
Kim, Eun J
[Link]
Do you get it? D… look at the D.
You are in our boosted loop.
change K, right? So you are changing the first…
position. So when first, for example, you access 0, you go to 1, 0, right? And the
800 position away, isn't it? Because it's a 100 element, and each one is 8 bytes.
Can you see that? So for… in terms of D, you… let's say you have an address
starting with zero, then next address will be 800, and then the other next address
is 1600, that is a stride. How much jumping going on, okay?
So, when people do prefetching, so we, in this year micro, we have two sections
talk about prefetch, because
machine learning algorithms and Bitcoin, everything, a lot of this big matrix, and
then we can see what data strategizes, right? What data is required next time. So
you can do beautiful.
prefetch, okay? You can have some, like, there are some ideas, like, run ahead. You
can have, instead of executing itself, you can have only for loop, and short one,
you go through only load instruction.
bars.
Okay. Is this… is there an alternative solution to this, of changing the layout of
the addresses mapping to which banks, instead of how you choose your accesses?
Okay, so when we talk about Stride.
We, you can think, okay, the way we interleave between memory banks are fixed.
Okay? Let's say this is… we don't control it, and this code is fixed, okay? It's
written by user, it's given, okay? Then, your architecture, what you… what you can
do. If you predict this every 800,
Right? You can have a perpetual logic go through this, right? Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to memory hierarchy, we talk about blocking, right? So, blocking can improve
locality in cache-based systems, okay? So, we assume in vector processing, we don't
use much of a cache, so we need to have another technique to fetch elements of a
vector that are not adjacent, sequential in the memory.
So, here, distance separating elements to be gathered into a single vector register
is called a stride, okay?
This example, matrix D has a stride of 100 double words, 800 bytes, and matrix B
would have a stride of 1 double word, 8 bytes.
For color major, Which is used in Putron, it will be different, okay? So I
calculate in the…
Rule major, based.
Once a vector is loaded into vector register, it acts as if it had a logical
adjacent error.
Kim, Eun J
Kim, Eun J
[Link]
I don't think I can finish this, so I will come back to slide the next time. Okay,
so this is it. I will come back. All right, so I have to run. Okay.
Anyone needs to see me? I know… Maybe we will… you can visit me on Friday, okay,
during class time. I will be in my office.
Sorry. She'll be in office, yes.
Nov 17:
In the afternoon, so…
Yeah.
That would be nice.
Ice cream. So, what sort of flavors did they have?
Oh, that's an interesting… I got a goat cheese with honey. Yeah, but I couldn't
taste it. Really?
That's disappointing. And is it dark chocolate, or no?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
may not be sequential. So, 15 divided by 2.92 class…
Kim, Eun J
Kim, Eun J
[Link]
Good afternoon. Let's begin!
So, can you hear me in June?
who are there.
For people in Zoom, you can hear me, right?
Go ahead. If you can, can you…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
cycles.
Okay, so we will round the up.
Kim, Eun J
Kim, Eun J
[Link]
Alright, so let's begin.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to 7 class cycles.
Kim, Eun J
Kim, Eun J
[Link]
in the middle of this set of slides, and then I know some of you submit the quiz.
But I gave a… I changed the deadline for this week, so we can do it together.
So we will be finishing up this SIMD architecture this week, and then start CMP
multiprocessor.
Any question? How was your weekend? So, how are things going? So, are you done with
homework 4?
So you submit it?
When is the deadline?
Wow.
Okay, so the cash replacement policies, those are done.
Already. So you started to do your Tom project then? Okay, very good, very good.
And use the keytop, okay, do commit as much as
as… many as possible, so I try to give, some…
extra point for that, okay? I don't want you to work on your tons project only
Two days before deadline, okay? So you guys, I want to see your GitHub, how it is
used between your teammates, the communication, and you commit each other, and you
can help each other, so those will be used as evidence of your contribution, okay?
Told you.
And then the homework 5 is mainly the final exam I share, okay?
Okay, so we are marching to the end. I'm so happy.
Okay, and these topics are more interesting, so it'll be good.
All right, so you know the way we calculate this, so how many banks, in theory,
required. Let's say if we have 32 SIMD processors, and then each… each clock
They have four loads and 2 stores. Why it can be? Because it's SIMD, you have a
multiple thread going on. And then there are differences between memory cycle and
the CPU cycle, so it's a roughly
8 times, right? So, which means while memory serves 1,
the CPU can generate 8 of
Right? Each A cycle, and then each cycle, you have a 6, so 8 times 6.
So this is the calculation. Each processor use 6, and then this is 8 times. This
is, 1330 banks. So ideally, if you have all different banks.
Each… requests go to the different bank, then it can be served how long?
This much time. 15 nanoseconds, you can serve all the requests, right?
It's the ideal case. And it won't be like that. Why?
So let's say we… we provide multibank, we talk about multibank, right? We talk… we
provide the multibank, and 13…
13, 30 different banks, okay? And then…
in one memory cycle, it's not in one CPU cycle. In one memory cycle, this many
requests come in.
coming, right? And then, what is the assumption we make? You have all different
banks, and all these 1330 to cash goes to?
Different bank.
Is it possible? Which will determine
You know, the distributions of this request.
This is purely program behavior, right? Program behavior.
So if we have 13… 30 different banks, and it is all interleaved by bank, okay,
block level, then if it is 1330, we cast out all
Sequential, can you see that? All sequential, if it is interleaved by block, per
block, and it's 1330 is all, you know, sequential, then the
we will have a 1330 request go to every other multiple bank, right? So it can be
served within 15 nanoseconds.
Okay.
There's, things going on, okay, nowadays. Why memory is so important? This is just
the simple… the GPU applications. Now, we talk about
the,
Machine learning, right? So machine learning, recently, while I prepared tonight is
deadline, and so tomorrow is 6 AM, it's deadline, so I was working on that paper
before coming.
Barua, Trisheta
Barua, Trisheta
[Link]
I'll do it differently.
Kim, Eun J
Kim, Eun J
[Link]
It was an interesting…
pattern we observe that I cannot share. Once this is published, I will share. But
these are all different, like, you know, my earlier work
on training. Okay, training, you have, let's say, 32 different accelerators or
GPUs, and then you divide the data, training data, by 32, and each one of
accelerator or GPU trained with partial data, and then you do gradient exchange.
That's a huge data, so…
The design philosophy there, we want to have a throughput-oriented design.
Okay, we don't care how long it takes to deliver one.
But we wanna have…
per unit time, you want to process more, okay? So why I'm talking about this? The
GPU is more throughput-oriented. You have a lot of data coming, and the initial
delay doesn't matter, because we can do pipelining, we can do
You know, parallel processing.
However, the next chapter, so always contradict each other, okay? The CPU
applications, like spec benchmarks, or we use a lot of time parsec benchmarks.
Benchmarks are different.
Then we found there are not many, you know, communications going on. Like, you have
multiple CPUs, and then the next chapter is all about cache coherence protocol.
Okay, so the data I need for now, is not in my local cache, it's a remote cache,
and I need to bring it. That is a communication, okay?
But we found in CPU applications, we don't have many requests.
Okay? For a while, you have one load request.
And then for a while, you just compute with that, and then after 10 cycles, you
have another load. Very different behavior. So, from now on, let's try to, you
know, compare GPU, SIMD versus MIMD. The older CPUs we have is nowadays the
multiple chip… multiple CPUs in a chip, right?
So there, you don't have many load instructions going to remote cache, That's why…
instead of a throughput, you care more what? So, let's go back to first week of a
discussion. What do we discuss about it?
We are designer of a CPU and processor and SIMD, GPU, whatever architecture, right?
What is our ultimate goal?
Reduced power is the ultimate goal.
Very good attitude. Sustainable computation, and Trump doesn't like that word. So,
okay, our… as a computer architecture designer, what is our primary goal? We cannot
compensate to WIMP.
Late… no, no, no. Latess is one of that, right?
So, in your Chapter 1,
What did I discuss? Okay, it's not comprehensive, but you would… you need to have a
big picture, right? The first chapter is important, although you feel like it's so
easy, right? In the first chapter, what did we discuss?
Power comes later. What we discussed before, power.
recall the questions you had in the homework and quiz, what did I… we did discuss?
CVPNs.
NTTF is, the, the, the reliability, right? So, yeah, so I, I can… this is the way I
sell my research area, okay.
power when, you know, the 20 years ago, people had a picture of an egg fry on the,
you know, motherboard, so that is a power, right? And then reliability because of
temperature. And yeah, then when we have a meltdown,
And spectra attack, okay?
Did I talk about spectral attack? Oh, maybe during the memory hierarchy, I was so
rushed. I will talk about it. So, they thought… they use…
Memory access time difference…
to get the secure data from library function, okay? Like, encoding, decoding, those
are library functions you are using. So I will talk about it when
multiprocessor discussions coming, hash query, like that, okay? So security, all
those, yeah, important, okay? So whenever I submit a proposal.
These are changing based on the trend, okay? Oh, security is really big on national
news, and then I will write a proposal. Okay, I want to design a system, number
one.
Now you know, right? What is number one?
I cannot compensate.
As a computer architecture, yes, I want to provide high, reliable, energy-
efficient, and, you know, secure processor, secure CPU, GPU. But the number one.
Performance, right?
Okay, security people, they only talk about security, but our work, we want to
provide a secure security measurement. At the same time, we don't want to degrade
performance. We always put the performance first, right? Okay, so performance. SIND
performance goal.
is more throughput. Do you remember when we talked about performance, there are two
different metrics we discussed, right? What was it? Number one?
Latency? Latency 2?
Throughput, right? Throughput is a reciprocal of latency, if you have only one
system, but when you have a pipeline, when you have a parallel machine, it's… that
relationship is not there anymore, right?
So, SIMD, throughput is more important, and accelerator design, yes, throughput is
more important, but then, you know, that is true when we have a training, a lot of
data exchanged, okay?
then… then you can guess what happened with influence, right? Okay, so then we… I
talk about CPU, okay? CPU, the main…
communication, main bottleneck is the cache miss. You don't have a cache in your
data, but the one you learned in memory hierarchy is when we have only one CPU.
Then you have a miss in Level 1, and then you go to level 2, level 3, right? But
when we have multiple CPUs in a system.
There is a notion of local memory and remote memory, okay? Especially from the belt
2, we share the cache, okay?
So, when I don't have that cash block, I need to bring it from another
side. Like, someone sent me an email about slicing and then multibank cash is the
same terminology or not. Yes, it's the same terminology. Intel used that
terminology. They said slicing. They do slice.
So if they have 16 cores CPUs, then you will have 16 bank
multibank cache, so you will use some portion of 4 bits to determine where the data
is, okay? It's a multibank.
So if that is not in your local L2, L3, then you need to get it from them. That's
all about next chapter.
Okay? The thing is, when we have a chance of miss in my local, it's very low. Like,
nowadays, cash hit
rate is almost 95, like, 98, okay? Maybe you already experienced. What was your
average hit rate when you do your homework for?
What was the number?
You don't remember? Hands-on benchmark. Yeah, I know, but then if you get this kind
of cancer, you need to give
Range or average, right? Okay, so what was the range value?
You don't recall? Okay, I'm not good at number, okay? So if I have this kind of
question during discussion or, you know, interview.
I can't… so I'm sharing my experience, okay? There are personal strengths and
weakness. So although I… I like,
Conceptual math and theoretical math, but I don't care about real number. That's
why computer scientists and computer architecture people, what we remember. We
always remember.
Normalized value.
So I can't understand, okay, this, like, for example, the population of this
college station is X times that Brian, that… stay with me. But then people talk
about, is it 5,000 or 6,000, 5 million? I don't know, like.
More than 1,000 is just a big number for me. So, what's the ratio of a hit rate you
got?
What's the ballpark number?
About 0.1% miss rate.
21% miss rate. Wow, that's very high. Like, one in a thousand is a miss? Axis is a
miss? Okay.
So for… to prepare, okay, if you have a job interview or internship interview, you
will be talking about your homopho, okay? Because SRRIP is very…
good paper, and it's been deployed many, many cache replacement policies. If you
choose the topic cache replacement for your town project, you…
Already noticed that.
that recent work even compared with the SRIP, okay? So the new day and night work,
right? And you will be talking about that during your interview, then…
Okay, interviewer, like me.
I don't know that well. So, the easiest question is, oh, what was your hit rate?
How much improvement do you see? Like, the kind of thing they will ask, okay? And
then, oh, I don't remember, I need to see the paper.
Bond, okay. All right, so it's around 20%… 2… 2%? 1 out of 1,000, destruction
myths. So 1 out of 1,000 means,
0.1%. Okay, so then with the SRRIP, how much improvement do you get?
That was the improvement you got. So what was the… the default value? You played…
you… Compare against.
You guys did your homework, right?
Oh my goodness!
Those, I very carefully select that. I know if you talk about that work during your
interview, people would like it, okay?
You should believe in me.
And you, you will screw up, like, not answering these simple questions.
You weren't… so you weren't curious about that.
Isn't it?
So, did you summarize
Numbered like this. Like, with the baseline, the… how much, you know, improvement
do you see?
Did you calculate it?
And then you just submit it and forget?
So is it more than 10% or 20%?
the before and after were extremely close, except on, like, two benchmarks where
SSRP was much better. So they both had, on an average, like, 0.1% misread. Okay,
okay, you don't see… okay, very good, very good. Very interesting, right?
Interesting behavior you observe.
So, in average, you don't see much benefit of RRIP, and there are only two
benchmarks you see some benefits, as why?
People will ask!
I mean, the point of SSRIP was that sometimes we have not cashable,
accesses, and we don't want those to flush out the actual, working set. So,
presumably, StreamBuster has stream accesses, which is one of the benchmarks…
Stream accesses, what do you mean by that?
When you have a series of accesses that are not repeated for a significant period
of time, catching them does not help, because there's no, local working set.
So, in your final, how this SRRIP with the tag, like, some bits, right, there are
two bits they manipulate. It will be in the final exam.
Okay, I didn't have the quiz on that, but I will give you same sequence of memory
differences, and then I ask if you use SRLIP, and those bits, how change, and
what's the final contents of your cache. That will be in your final exam. It's a
one-off question, okay?
Since you did, and you… you really wanna, you know, master the concept, okay, and
why it will work on particular benchmarks, if you didn't,
give a proper explanation in the report, you will get a huge deduction, okay? These
are, you know, core of computer architecture work. We… we did some design, and then
analyzed data you have, okay?
Basically applying the aging… Aging, yeah, it's aging.
So you compare against NIU, right?
Okay, so LRIU mostly work well. Then, if you have a similar performance, and then
what is the benefit of SRRIP?
So you… think about it. LRU, you have a simulator, it's not real hardware design,
isn't it?
Yes, I was gonna say… Yeah, yeah, yeah. Okay. Do you… do you see? NRU, perfect at…
implementing perfect NRU is so expensive in hardware, because you need to have an
infinite counter.
We don't have an infinite counter. In software, we pretend that we have an infinite
counter, right? We have a variable, so that we can compare which one is the least
recently used one. In hardware, no, right? SRIP, actually.
You are mimicking at all your behavior with the two bits, right?
And there still is some benchmarks that works better, okay?
Anyway, so we are handling with a 1% delay, a 1% miss rate, okay? But did you
implement SRRIP in the, level 1, or level 2, or last level? What was your
configuration? Level 3, okay.
Yeah. Level 3, of course, you will have very low miss rate because of cache size.
You have a very big cache, right?
So, based on that number, if we move to the inter-sliced cache architecture, let's
say 16 bank, so the probability of, that you missed that block in your local L2,
but it…
it is exists one of 16 other banks are very high, okay? So that's a thing, but
still, let's say it's 3%.
out of a 100 cycle, you have 3 misses going on, 3 things going on. So, latency
important.
And why it is important to improve, like, let's say before it was a 7 cycle, now
you reduce to 5 cycle, 2 cycle, it is a big deal. Why?
then you need to go back to Tomasolo and Harder's speculation. What was it? When
you draw dependency graph, what was the first instruction you have, always? The
root load, okay? So, handling load in
CMP application, CP applications, you have a very rare misses for the remote cache.
So, for me, it's the communication, like, when we have a, you know, mesh network
and the 16 cores interconnect, you have very rare things, okay, transactions going
on. But delivering that quick
is very critical for performance, whereas GPU, you will have a lot of misses, a lot
of traffic's going on, but it is okay to take one long, but then
Anything later coming is pipelined, that's okay, because you need a lot of data
transactions going on, okay?
So that's a big difference. So, CPU, latency-oriented design, when you design
memory, anything is latency-oriented. GPU is throughput-oriented, okay? It's a
different paradigm we have.
Yeah, I spent too much time to talk about it, but that's a very important thing for
your, you know, future.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Therefore, we require minimum of… 1344 memory banks.
So this is a really, big number.
To handle multidimensional arrays in vector architectures, we need to understand a
stride concept.
Positions in memory of organization elements in a vector may not be sequential, so
let's look at this code.
We could vectorize multiplication of each row of B with each column of D, and
stride mine the inner loop with K as index variable.
To do so, we must consider how to address adjacent elements in B and in D.
When an array is allocated in memory.
It is linearized, right? So it transferred to one-dimensional.
must be laid out in row major order, most of the machines, like a C. Note that with
protron, in proton, it will be column major, but most of the cases, we don't use a
protran, so let's stick to row major order.
This linearization means that either the elements in the rule or the elements in
the column are not adjacent in the memory. For example, the
Preceding C code, like the code you are looking at, we are having rule measure,
then the elements of D that are accessed by iteration in the loop are separated by
the row size, times 8, right, because we assume 8 bytes, for total, 800 bytes.
You should remember, in Chapter 2, Memory Hierarchy, we talk about blocking, right?
So, blocking can improve locality in cache-based systems, okay? So, we assume in
vector processor, we don't use much of a cache, so we need to have another
technique to fetch elements of a vector that are not adjacent, sequential in the
memory, okay?
So, here, distance separating elements to be gathered into a single vector register
is called a stride, okay?
This example, matrix D has a stride of 100 double words, 800 bytes, and matrix B
would have a stride of 1 double word, 8 bytes.
For color major, Which is used in protran, it will be different, okay? So I
calculate in the…
Rule major, based.
Once a vector is loaded into Vector Register, it accessed.
Kim, Eun J
Kim, Eun J
[Link]
So in B, if you look at B, K change first, right? So once you get B00, then you go
to B01. Then, so stride is the distance for the two consecutive accesses for the
same element. So it's 8 byte, right?
How about D?
K change first, so from 00 equals to 1, 0, and then 2-0. So what's the distance
between 00 to 1-0? How many elements do you have in the middle? 100, right? Whole
row. So 800 bytes. That's a stride.
Okay?
So, we use a lot of, we try to…
predict the stride. So, when you run this a couple times of iteration, you can
easily get, right? Or compilation time, you can get stride information. So if we
wanted to prefetch, we know stride, then we… we can have that load much quicker,
right? Run ahead, something like that. So those are very…
Critical information to understand.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
As if it had logical adjacent elements.
Thus, a vector processor can handle strides greater than one called non-unit
strides, using only one vector load and vector store operations with a stride
capability.
This ability to access non-sequential memory locations and to reshape them into a
dense structure is one of the major advantages of vector architecture.
Supporting strides greater than one complicate the memory system. Once we introduce
non-unit strides, it becomes possible to recast accesses from the same bank
frequently. So here, when multi accesses content for a bank, same bank.
A memory bank conflict occurs, so a conflict bank conflict occurs.
So, we need to store, right? So, bank conflict may occur this way. The number of
banks divided by list the common multiple of a stride and number of banks.
That is smaller than bank BG time. So, you see, you should have a big enough number
of banks, and then the… considering the stride and the number of banks.
You will have an example in the quiz.
Kim, Eun J
Kim, Eun J
[Link]
So let's partly tracing. Okay.
So, okay, here, you can think this way. So…
Bank BG time, once you have access, then it will be budget dose time. But then, the
interval for the next request coming to the same bank is this.
Okay? So if this is a short term, when you arrive, this bank is still busy, you
should wait more, okay? If it's after that.
You don't have to wait. It's a bank conflict, right? So, this is the number of
bank, and then LCM list common multiplier, and number of bank how you are skipping,
and then the number of banks you are coming. It's the way… it is the rate of coming
back to the same bank.
Okay, so let's, stop this, and then do Chris together.
See?
So some of you already done, right? And you can volunteer then.
Okay.
So here, you have 8 different memory banks.
And then, you have an initial delay is… 12th.
And then BG time, when you have… after, you can imagine.
the… so at the beginning, it's a 12, and then bank busy time is a 6. So once… when
you come back for the next recast, you need to wait, if it is busy, for your
service, you should wait 6 more cycles, okay?
So there are two things.
Stride 1. So, from here, the hint is a stride one, every next, like, a CP… the SIMD
processor will generate memory locals every cycle, okay? Every cycle.
And this 12 and the 6 is a CPU cycle. You don't have to convert again. So we would…
we are dealing with the CPU world.
And, so…
So, when you have a first recast, and then you… you will have a stride one,
consecutive recast for how many? Your… your elements, 64, okay? So, if it is stride
1, first it will go to bank 0, then it will go to bank…
1, 2, 3, and then it comes back to 0 when, after
7. Can you see that? After 7 means your BG cycle is 6, so you can avoid that BG.
Do you know what I mean? Okay?
Alright.
So, calculate it.
Oh, I would take this one. So, your memory is, let's say, is called at the
beginning. So, first access will be 12 cycle. The meantime, you have a next request
come in the queue, right? Then it's an additional 6 for the next request.
So it won't be… so only first one takes 12, then for the next, and next, and next,
because you… your memories already have a robot or whatever, so it's, the… you… you
just need the sixth more additional cycle for the next element.
Okay?
Is it clear?
And then the 32 stride, the idea 32, look at that. LCM was LCM of Stride and bank
ID, right? Bank number. So here is the multiple of 8, which means you go to bank 0,
the next request goes to where?
Ban Kajuro game. And the Bank of Euro games. Ben Kajiro. Every cycle, right? Okay.
Whereas the first one, you go back there every A cycle.
Right?
Isn't it?
Because it's the way you do interleaving.
You're, do you remember when we do cash bank?
Okay.
So, simple thing, if you have a full memory bank, What do we do?
Is that the in-memory address?
Last two bits will determine where your… which bank you need to go.
So if it is conjective, it goes here, then next will go here, go here, like that.
But then, if your number is, let's say, this is 32, then always 0, right?
Okay, it goes back to the first bank all the time.
It doesn't say how much time it takes for… so this… we assume that all the 64
assigned at the first one? No, no, no. I assume, I assume. 64 element from
CPU side, processor side, will generate one by one, because
Because you run the program, right? You have a PC for load the first one, and then
a second one, third one, like that. Like, the way we did, you know, Chime and
Convoy, do you remember? We did, all 64 elements, but it is a pipeline there,
right?
So, like, if there are 8 platforms, we can access 8 roads at the same time, right?
Yes, yes.
But, but, okay, what… that's what… what I meant here. Let's say bank 0, bank 1,
bank 2, da-da-da-da, okay? So let's say stride 1. You come here, it will take 12,
okay?
And the next one, it will go to Bank 1 after one cycle.
The generation of the education is after one cycle. It's pipelined. You don't, you
know, initiate this all together.
So this will be 12.
12, and so on, okay? Da dot dot. So then it will be 12.
Okay. And then it will be having the 6, right? 6 here. And the next ones, 9th,
because coming back here, it is already busy, so you will use this 6.
In fact, it isn't included in the total number?
No. So when you come back here, it's still busy, right?
So you need to have an additional 6 to wait to be done for your element.
So for 64, if you divide by 8, it will be modulated, right? So, how many? 8. So you
will have a 12, and then 12, if I roughly calculate 6 multiplied 8, isn't it?
this will be it, and then this beginning is 8, so you will have just the 5, because
you will have the 5 element plus 8. This is the total latency of the first stride
one.
How many… so, so, okay, so you have 1, 2, 3, 4, 8, right? 9 can requests go to
here.
Okay, so you will have a 6.
And then… then you need to 8, 9 plus 17th, 17th, you just go back here, and… and so
on. So how many 6 you will have?
7, okay? So it'll be 12 plus, sorry, not 5.
7? Okay, so this is from 12, and then for the bank, 0. But then the longest one
actually is a bank.
7, isn't it? But the… at the beginning, A cycle is idle.
So you will… you need to add 8 here. So total will be 12… A times 8.
This is the… The latency.
So try for the… try the 32. It's a simple lure, even.
I didn't get the 8 times 8 aren't… Okay, okay.
maybe, like an undergraduate, I should have 8 people lined up, that, you know.
It's a volunteer. So let's say we have full bank, okay, full bank. Your… your SIMD
processor, you throw recast one by one.
every class, when, you know, class comes. So this will go, at this time, zero.
And she will take a 12 cycle for this request. Meantime, next cycle, the second one
go here.
Because it's a stride one, means it goes always the neighbor bank. So then third.
Fourth, okay? Fifth, go back here.
Okay? Go back here. So this is it. So, when you look at the first bank.
is busy, right? And then when the second one come back, it's still busy, you wait
until the first one is on 12, and then for you, it will take a 6 cycle, additional,
only… so second element, it will take only 6 more cycles.
The first one, cold memory, you have a low column, you know, things arranged, so
this is the way.
Okay?
Like, why? 12 plus 7 times 6.
Okay, alright, alright. So, how many docast Per bank you will have.
So your total is 64, right?
And do you believe it's interleaved by 8, right? Modular 8, then how many? Each
bank 8, right? So, you will have 8.
The first one takes 12, then other 7 take only 6 more for each. That's why it is a
6.
7… 7 multiplied 8. Does it make sense? Even at 7 multiplies… Oh, yeah, yeah, yeah,
oh, sorry, wake up.
Because I was busy right… okay, okay, this is it. Alright.
6, right, right. So this is a… so, Brian, you're right.
Sorry!
So, to be… Precise, it will be, like, 6, and then 8, okay?
I know. Okay.
How about the next case? Is it easier? Where does the fire come from, like, after
the sudden…
So, okay… let's say I have only, in this scenario, 12 requests.
Okay, you count your delay, So she will have a 12.
12. But when she has a 2 plus 12, it's 1 plus 12. Yeah, it's literally… 2 plus 12,
right? 3 plus 12. Can you see that? 3 comes, right? So actually, it's not 8, it's a
7, precisely.
Because it's a zero time, it goes here.
Second one is 1, third one, 3, 2, okay? And then B7, it will be 7, okay? And from
7, it will take 12, and then you will have
7 more 6. So, yeah, correct, 12A times 6, okay.
So, thank you! This should be final. No more correction!
Thanks.
You got it? It's pipelined! It's 0, 1, 2, 3…
And then 0, 1, 2, 3, right?
8th access, right. It will… it will wait until 12 finish, and then your BG time
means, since it's BG, additional 6, you will have for the second one. It completes
with additional 6. Yes.
Yes.
But it starts only at 8, right?
Total memory access for one load is 12. When you start at 8, it will complete with
an extra 6. Totally still 10 seconds, fully completed. First, the, let's say, bank…
the 8th access, it will start at cycle 8.
But if you only… 12, Let's say it's 12 seconds, right? So, so let's calculate, so…
It'll be okay. 12 plus 6 is 18. How about the third one, when it comes?
Third one when it comes. 16, right? Third one. When it comes back.
Thank you.
So, 12…
6, 4, 6, and 5… So, 12 plus 6 is… this is the second one, but then the arrival time
for the second request is 8, right? 8. But then, still, this is a BG, so you wait,
right?
And then you… you solve this next one, but how about… how about third element to
the bank zero? It'll be 16, right? The same thing. When it comes, it's still busy.
Bank every time is 16. Can you issue a request? Oh, no, no, no, no. Okay, so…
Okay, so this is how we interpret from textbook. Only… this interpretation makes
sense. With the numbers. How I interpret it? So 12 is the initial delay of memory,
okay? So, if you arrive in the middle of this busy time.
The next request will be only additional 6 more. So, first 12 required for the
first request, and the second request, additional 6, and third, additional 6, like
that. Yeah.
So when you arrive, it's BG, you will have… you will wait until the first earlier
one done.
but then additional 6 cycle required. Why this number change? I interpret this way,
you have a row buffer, and then the initial… the low end column activation delay,
so usually memory recess time is decomposed to this initial memory access time
versus second access time. So it's much lower than the first
So, exercise time.
Because you may… you have a chance of a hit… hit on the rope buffer, or at least
you don't need to reinitiate low and then column strobe. Those times are saved.
So this is a rough estimation. It's not an exact number. Yeah, that's 7 into 7,
right? Yeah. Whatever, okay. I don't care. How about the second one? Let's contrast
with the third one.
I'm not clear. Is our assumption that every cycle we issue to one new, yes, every
cycle. So, the, the first cycle we issue to the first bank, second cycle, second
bank, stride of one, and when we're issuing the eighth, or the seventh request… 8th
request, it goes back to bank zero.
Record. Yeah, yeah, if you start with the zero of… so, not first, just recast the
zero, recast the first, recast, yeah, and then it's a sub…
12 is still occurring. Occurring, so you wait until it's done, and then for the
second recash to the same bank zero, we'll take only additional 6 more. Okay.
The meantime, the next cycle, next request to come to bank zero again, but then it
will wait until the second request is done, the third one will be served after this
and the additional six cycle.
So, this is all, you know, the thing you want to take away with this quiz.
Your request with the stride one will be interleaved, modulated, so each bank will
have only how many? Eight requests to come. And then these all arrive when the bank
is still busy, so you will have an additional 6 cycles only, okay?
Alright.
How about stride 32?
Yeah, so then what is the number? So your initial first one is 12, and then 12
plus…
6 multiplied 63, isn't it?
I should be quiet. No, I will wait until you got it. Oh, how many? How many
elements? 64, right?
So you need to multiply… 63. If I… You follow my interpretation.
12 is the first request service time, 6 is when the next request comes, is still
busy, then additional 6 more.
So, if you have all 64 elements goes to the same bank, the first one takes 12, the
later
63 requests will take only 6 cycles. That's how I interpret.
The answer can be different slightly from the textbook, but it doesn't make sense
if you do that one, okay? So is the 12 per bank, or just the memory entirely? 12.
The 12 cycle latency, is that… Per bank. Yeah.
So, so this quiz, okay, highlighted that.
The stride concept.
If a stride is 1, it will go to every request will be modulated to different bank,
right? Whereas if your stride is multiple of your bank number, it always go back to
the same bank. We call it bank conflict happen.
This doesn't start. Okay, so in GPU and machine learning application, bank conflict
is a very important concept you want to avoid.
Okay.
This'll be the second. It'll be done on the end of 18.
Isn't it just, like, 3 times where they all go to zero at… You're awful.
Okay, so tell me, where do you go to the next one? Stride… we just stride the 32.
The second request of where it goes. Okay, very good.
And then how about third one? Yes, so all 64 goes to Bank Zero, isn't it? Okay.
Where do we use the 32?
Oh, because of the 30… so let's say if I change it to 31, okay, let me change the
stride number 31.
Where second request goes, after first one goes to Bank of Jira, where it goes?
8th one, because it's a modular 8, and then it's an 8, right?
7th, so you go to Bank 7, right? The name of 7. Yeah. You got it?
So the… So she asked, in this exercise, okay, let me write down.
For the 32… With the 32 stride, actually, it will be… B0, B1, B2…
Whatever. Always go back here, zero, and then…
32, you, you go 4 times of this, right?
Remainder is zero, so it still goes here.
And then the next one goes here. All 60…
Four elements go to bank zero only.
Okay, so all serialized, so you will have a 12, and then the 6 multiplied 63. This
is the time. You can compare against this one, okay? And then she asked, where do
we use the 32 information? If this I change to 31, what is it?
Your first goes to zero, and then second, 31, if you add the 31, it goes here,
isn't it?
Then how about the next one?
Anyone?
B1?
And then you go 2, so it's a modular function, right? So you, you, you, every time
you just, from here, you add… no, third one is not this, so you add from here 7, so
1, 2, 3, 4, 5, 6, 7, so it'll go B6.
The third one is going here.
And then, fourth… Yeah, yeah, yeah.
And then like that, okay? You can see the… so the patterns you are having to each
bank is different when you have a different stride.
Okay?
So this is a exercise question from textbook, but then my solution is a little bit
different, because the way I interpret BG and the initial time is different.
Any chance? Okay. For the second one, why is it, like, 6 times 63 and not 64 if the
first one was, like, 6 times 8?
First one is a 6 times 8,
Yeah. So, so this comes from here, okay? So you have a 12…
And then 6… how many 6 you have? Actually, 7. You have a total eight, okay? And
then, this… do you agree, PGRO finished to serve these 8 different requests first?
Which one is the slowest one?
B7. Because the B7, first time you get the glucose is after 7 cycle, isn't it?
That's why I had the 7.
Whoa.
Okay.
Okay, so in your code, so this is… we only talk about memory access time occurred
in memory side, right?
What initiates this request?
Processor, right? In processor, you have…
64 elements, so when you have
load V, and then you have address, and then this, whatever, vector address too,
right?
This is the way we handle its pipeline dance.
We send the first request and second request all Okay?
We don't send them simultaneously. You can't, because you have a serialized bus!
Even for the second one, shouldn't it also have, like, shouldn't we also be having
a second? No, no, no, we assume your… your processor side generation of this
request is every cycle you send… you… you send, okay?
Yeah, that is a, you know, require some assumptions.
As long as you get the stride correctly, you get the correct… So find this question
in the textbook, and then you can read more detail, but this is how I interpret.
Your processor side, send the request every cycle, consecutively, and then the main
point is with the stride one, all modulated to different bank.
But stride 32 is the 8 times number, so it always go to the same bank. That's why
it's all serialized delay, okay?
Alright.
Okay, so let me clear up.
The next one.
Pasta Thank you.
So, okay. In fact, the processor?
The way we program, we… instead of… so, the earlier one.
Earlier one, you have a for loop, i equals 0 to I is a 64, and then you have a load
W, okay? You had this for 64 times.
Now, with the vector processor, we don't do this, we have LV. Okay, we have a
disinstruction.
Inside this, how we implement is similar to the individual load, but you send the
starting address.
Okay? And then you will do a scatter and gather. You send the 64 requests to the
memory, one by one, together, or work, then you got the data, and then you load the
two-vector processes. Does that answer your question?
It's 12?
-Oh.
It explored exactly what it is.
So that's what we did just before, okay? So if you're this,
The 64 elements is a stride one.
Nearby each other, then you use all 8 banks
one by one, simultaneously, but if stride is go 32, 8 times, so always go to the
same bank. You cannot paralyze. Even you have 8 different banks, it's serialized,
okay? That's all about stride, the concept.
So, man, so the compiler tried to do something to avoid this bank conflict that
happen all the time, okay?
All right.
So let's go to the next set of slides.
It'll take me here. The total memory latency is how long it takes to read values,
and 6 is how long it takes to issue a request.
Okay, so we are done with the… Last week's Well, now… do that.
This is… then why? I cannot, change this at all. How do you get 61? What happened?
Are you able to see this module?
And the Y, I don't have any, editing.
Stop.
Am I in student mood?
Anyway, let me go, and then I will fix it later, okay? After class, I'm sorry. You
cannot see, oh, online students cannot see. You can read from me.
All right, so we will finish up this for this, and then you will have one more week
to read this chapter 4, okay? So let's talk about Intel Machine.
And you can relax a little bit, okay? But there is one graph
for your lifetime, you wanna remember. How you remember, if you understand, you
won't forget, okay? So, you can be relaxed for other slides, but when I say
attention, you really need to understand that, okay? Because…
That is, if I interview you for… if I'm working in AMD nowadays, everything is
throughput and memory and computation power versus memory bandwidth, the
understanding this loop line graph is very important, okay?
I never put that question in the final, but maybe I shouldn't make.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Oh, with this setup.
Kim, Eun J
Kim, Eun J
[Link]
The more important.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
on SIMD extensions for multimedia applications.
SIMD multimedia extensions started with a simple observation that many media
applications operate only on narrower data types than normal 32 processors were
optimized for.
Graphics systems would use 8 bits to represent each color, plus 8 bits for
transparency. So, by partitioning the carrier chains within, for example, 256-bit
header.
A processor could perform simultaneous 32 operations of 8-bit operands, so we can
parallelize.
Vertility operations.
Unlike vector machines, These SIMD instructions tend to specify fewer operands and
thus use much smaller register files.
In contrast to vector architectures, which offer an elegant extension set that is
intended to be the target of a vectorizing compiler, SIMD extensions have three
major omissions.
First, no vector length register.
Multimedia SAMD extensions fix the number of data operands in the OP code, which
has led to the addition of hundreds of instructions in Intel x86 series in MMX,
SSE, and AVX.
Sector architectures have a vector length register that specifies the number of
operands for current operation.
These variable length vector registers easily accommodate programs that naturally
have shorter vectors than the maximum size the architecture supports. Moreover,
vector architectures have an implicit maximum vector length in the architecture,
which, combined with the vector length register, avoid the use of manual pickles.
Second.
There will be no sophisticated addressing modes of vector architectures, such as
stride accesses and getter scatter accesses. These features increase the number of
programs that a vector compiler can successfully vectorize.
Third.
Although this is changing, multimedia SIMD extensions usually did not offer the
mask registers to support conditional execution of the elements, as in vector
architecture.
Kim, Eun J
Kim, Eun J
[Link]
I never ask your memorization.
Okay.
So, but I would remember, if I were you, this set of slides, SIM, the extension,
slide set is here, okay?
If you have an interview with Intel, you want to review this before interview day,
okay? So you can compare
as I am the Intel series versus GPU and vector architecture. Okay, these are three.
I'm not testing you on this memorization, but if I were you, I would refresh this
before interview with Intel, okay?
So, do you recall what is a mask register?
Right?
It's a register which disables a lane to do predication. Yeah, so we make a
predication, right, so that we won't have a divergence in each thread execution.
So, Intel doesn't do that, okay? They don't do it, and I feel like this is the
biggest difference, okay? It's different from…
And then, you know, this complicated addressing mode.
Contrarily, if you learn x86 Intel assembly code, they provide the most complicated
addressing code, okay, risk architecture. Then, when it comes to vector, they don't
want to provide, like, a striding and scatter and gather things.
Okay. Yeah, they just come try to fit this vector processing in existing their
architecture. That's why a lot of good features that get rid of, that become a
bottleneck of their
Yeah, this is just the first.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
For the x86 architecture, the MMX instructions added in 1996 repurposed 64-bit
floating-point registers so the basic instructions could perform 8 of 8-bit
operations, or 4 of 16-bit operations simultaneously.
Note that MMX reused the floating-point data transfer instructions to access
memory.
The streaming SIMD extensions, called SSE Successor in 1999 added 16 separate
registers that were 128 bits wide.
So now, instructions could simultaneously perform 16 of 8-bit operations, or 8 of
16-bit operations, or 4 of 32-bit operations.
Intel soon added double-precision SIMD floating-point data types via SSE2 in 2001,
SSE3 in 2004, and SSE4 in 2007.
With each generation, they also added ad hoc instructions whose aim was to
accelerate specific multimedia functions
to be important.
Kim, Eun J
Kim, Eun J
[Link]
So, they didn't have a big picture.
Can you see that? Whenever you… like, if you submit paper, and then if I review
your term project, or your idea, or your solution sounds, like, ad hoc.
That's really bad. Okay, because it only works for now.
Right? You really need to try to think what's the cause, real main reason for this
problem, right? Where it goes. You shouldn't provide just a time-to-time solution.
The intel did that way.
why didn't they… like, what stopped them after the initial MMX, or any of these
after, to add the things that they were missing? Like, or even now? Even now, so
even now, they lost the battle against NVIDIA, right?
So it's, it's… it's… like, they… even recently, they tried… they keep this…
SIMD division. They want to use… still, I think they hope they can use their own
customized one, but users, we don't make a system that way. Even Intel provides a
small scale of SIMD vector processing. What we do, we buy GPU, and they put them
together, CPU and GPU, right? That's what users do.
And so you lost the market.
And then we will, you know, the, the… NVIDIA GPU…
The biograph of Jensen is so touching, right? Did you see his biography, Life
Story?
Oh, you should! It's a very fun story. Anyway…
He had the big picture, and then he was lucky, right? He come up with this vector,
like, a messably parallel architecture for GPU, then Bitcoin comes, or when AI
comes, you know, that works well. But… but what…
They did well in the CUDA library.
They provide the CUDA library, like, it was early 2005 or something, 2000, and
people start to use the GPU only for even general-purpose programming. So that's
the main reason, and then Bitcoin and these kind of things coming.
So there's low-key things also, but a lot of times, love won't come by itself.
a lot of trials and, you know, braveness required when you do this. And, again.
if someone said to your solution is ad hoc, it's worst comments you may get for… as
a computer architecture person, okay?
So you should try to have a big picture and, like, what's real problem, okay?
That's what I felt like, okay, I feel sorry for Intel, but…
That's what happened. But, you know, but still, Intel is doing well, right? The CPU
market is still an interview is… the Intel is number one, so if you have a chance
to interview with Intel, just review these slides, and then the next one, okay?
They will be good.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The advanced factory extensions added in 2010 doubled the width of a register again
to 256 bits.
Kim, Eun J
Kim, Eun J
[Link]
N?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Offer the instructions that double the number of operations on all narrower data
types.
AVX512 in 2017 doubled the width again to 512 bits, doubled the number of registers
again to 32, and added about 250 new instructions, including scatter and mask
registers.
In general, the goal of these extensions has been to accelerate carefully written
libraries, rather than for the compiler generator.
To get an idea about what multimedia instructions look like, assume we added 256-
bit SIND multimedia instruction extension to RISC-V.
Alternatively, let's call RVP for PET.
We concentrate on floating point in this example. We add a suffix 4D on
instructions that operate on four double precision operands at once.
Like a vector architecture, you can think of SIMD processor as having lanes, 4 in
this case.
So RVP expands the F registers to be full width, in this case, 256 bits.
This example shows DAXP.
Y loop, with the changes to RISC-V code 2, S-I-N-D, on the line.
Kim, Eun J
Kim, Eun J
[Link]
Same code you had for meter.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The changes were replacing every risk fiber double precision instruction with its
4D equivalent. Increasing the increment from A232, and adding the SPLAT instruction
here.
That makes 4 copies of A in the 256 bits of F0.
However, it doesn't provide, like, as dramatic as vector processor speedup four
times. It does almost 4 times the reduction.
This code knows the number of elements. That number is often determined at runtime,
which would require an extra strike mine loop to handle the case when the number is
not a modular of 4.
One intuitive way to compare…
Kim, Eun J
Kim, Eun J
[Link]
Okay, this is an important one.
And then, I will start from here. Next.
Wednesday, okay?
If, you… you have 5 minutes, I… before class, you want to read this first, okay?
Because this is a very important concept, and we use this a lot when we discuss,
like, how much more bandwidth we should design for memory for this CPU speed, okay?
And the AI application, we open draw this again and again. All right, thank you.
Bye, bye.
Okay, we're adding vector length for this to some nests, and some… Absolutely. It's
gonna say that if you haven't set it to anything, then by default, like, all these
ABC instructions on that, and then you could just say…
Do they have that advice to you?
Like, yeah, everything's a subscription.
You can buy a cheap film there.
Yeah, that's how…
Nov 19:
Kim, Eun J
So it's sometimes very hard for people, like, to get used to the tools. You don't
have to work a lot, but I mean, like…
Well, Austin pays less in the California office. I think it would be…
More, but the cost of living is more, so it balances out.
I don't know, I don't have time to, like, calculate any of that, so… I'm just gonna
see. Whenever I have told my professors that we're gonna fill… Yeah, it's with
Aurora.
They have through the way EJ talked about.
You and what? Maybe I can…
Like, yeah, that's the thing. Intel leans a little more towards my long-term goal,
but also, it is just an internship, so… I mean… In the market, so for interns, I
like, there are, like, millions of them.
But Hartford entering school are, like, very little. I think that's why they went
with me, because I've got the Hollywood background, and I might be better at…
What a rough life! I know, I'm, like, shocked.
I can't believe it. Yeah, like, it's a tough decision, but whichever decision is
still a good one. Wait, for summertime? Or is this a job? No, it's actually May to
December. May through December for the co-op.
That's really cool. The weather in Oregon's fantastic in the summer. Fantastic.
Where the food is, where the people are. Do you want to make friends with Austin?
Oh, I see there's some lobbying going on. She's awesome. I put Oregon because my
family lives in California and Washington. Like, I potentially might have…
I haven't heard of AM, but it's a tier software, so I don't know if I… but, like,
I'm not… I'm not gonna not take the…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Now, with this set of slides, we're gonna learn.
Kim, Eun J
Kim, Eun J
[Link]
Intel, but it's in Oregon. But then, work-wise, it's kind of a tough spot.
But I don't know if…
Like, what are they… are they having to do similar things? So, the AMD one, from
what I gather, is making software for the CPU curve, too. Okay.
other one at Intel is just the firmware physician, and they make firmware for power
management on… Yeah, so I'm like… and then I'm thinking, okay, like, long-term,
where do I want to be, and like…
Which one might you have? I have a family in Austin. Okay.
There's so much to do in Austin, which is really cool, too. Yeah, well… I mean, I'm
sure Portland would be great. Not as much as… not…
I mean, there's hiking, and it's gorgeous, and they've got lots of…
Good afternoon. Let's start. It's at 2.50, right? So let me… That the timer.
Last time, someone raised question that the way I explained, it doesn't work,
because the time you got the request over, let's say, K, the request.
These… if I follow… you follow my assumption on, like, how you interpret the 12
latency and 6 BG time, then your bank is still BG, so it won't work.
So then I… look up, How to interpret the… 12 and 6?
latency and digit time. This is, how I understand. So let's look at… I posted in
the PIJ, okay?
So let's look at bank 0. Bank 0, the processing unit will generate request every
cycle. So it goes to 0, then 1, bank 1, bank 2, bank 3, and so on. Can you see
that?
I got this table, help with the ChatGTP, I'm lazy, I asked to draw, okay? I
explained, can you… can you draw a table? So he gave the… or she gave a table, so…
Save my time, okay? So here. So every request will be modulated to a different
bank, okay? Let's look at the first line.
When you receive the request, memory will be busy for first 6 weeks… 6 cycles,
okay?
those times, you cannot accept any new requests, because you are using, like, you
know, the column strobe and, the column and row strobe, and activate the point of a
DRAM cell you need to read. Those cannot be shared by another caster. So 6 cycle
is, I…
you know.
the… used by the first recast, and then it will… the memory will be pre-op after 6
cycles, okay? But then it will, require additional 6 cycles to finish. That's a
whole latency.
So these, I… I assume these are, like, using bus, and then it can be pipelines,
okay? Can you see that?
So then, you know, so when it… the Bank Giro got the second recessive at Cycle 8,
Actually, memory is idle. Can you see that? Because it's done with the earlier one
at 6, so…
At 8, you see idle, so it can be served without any delay, and then it will go,
okay? It will be beautifully pipelined. So what is the total time?
Then I look at the, bank 8, because bank 8… 7, actually, the last one, will,
experience the longest time, because the first request arrives
latest, right? So this is the time. You got the 7 cycle, you got the recurs, and
then the first return 19, and then… then every issue will have 8 cycles difference,
so it'll be like this. So.
75 is the last one.
Okay, so this is the way you can, calculate Okay.
So, 75, you can assume you have a, like, a fir… the… first, you have a cycle idle,
and then… then that is first one. After that, you have 63 more.
Every A cycle, right? And then you have 12 or more, it takes time. So that's the
time you get.
75.
Is it clear?
If, for those who are late, you can watch Zoom, okay? And I don't want to spend
more time on this.
But the… the silver 32…
It's okay, because every time you come, there is a conflict, right? So it will be
all serialized.
So, 8 times every 8, every 6 you wait, and then additional 12 cycles. So…
That's, finally, I come up… I come up with a resolution, okay? I couldn't interpret
the number, but I know when you have a Stride 1 versus a stride 32, which is, you
have a bank conflict all the time, so you… you will need to wait, okay?
Every memory bank will have a separate memory port, so you can have concurrent
memory accesses going on in each memory bank. That's the idea of a bank.
Okay.
Alright, so… Okay, I post this in the Piazza, so you can look up and then study
about it.
Let me go back to… Yes.
And I told you this is important, okay?
You will see loofline performance model graphs.
a lot of places in… whenever we talk about data-intensive applications, which is AI
applications, right?
So, let's, look at that.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to compare potential floating point of performance of variations of SIMD
architectures is the LUF line model.
Kim, Eun J
Kim, Eun J
[Link]
Let me… let's start from Fijiri. Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
One intuitive way to compare potential floating point of performance of variations
of SIMD architectures is the loofline model.
The horizontal and diagonal lines of the graphs, it produces gives this simple
model its name and indicates value.
It ties together floating-point performance, memory performance, and automatic
intensity in a two-dimensional graph.
The definition of automatic intensity is the ratio of floating-point operations per
byte of memory accessed.
It can be calculated by taking the total number of floating-point operations for a
program divided by total number of data bytes transferred to main memory during
program execution.
This feature shows the relative arismatic intensity of several example corners.
Some corners have an automatic intensity that scale with the problem size, such as
a dense matrix.
But there are many corners with isomatic intensities independent of problem size.
Peak floating point performance.
Kim, Eun J
Kim, Eun J
[Link]
So do you… did you get the keyword kernel? We talk about, we, part of a code, is it
famous, we name the code name, so we have, a layer, like, this…
the SARS matrix, MB, and then in AI application, you see a lot of GEM, like that,
that's the kernel name. And based on the kernel, you can see whether… we will learn
this, okay?
Once… arithmetic intensity, nothing, but you brought data from memory, then on the
same data, how many arithmetic operations you do?
Okay, so more means you need to have higher computation power.
Okay, then less means your performance will be limited by memory bandwidth. Okay,
this is all about it.
So, the… if you look at GPU, the paper, architecture papers on GPU or AI
accelerators, a lot of times, yes, when we do have an inference or training time,
we use, some model, like a LAMA model, transformer, like, model itself.
But then…
If your architecture is mainly, like, focusing on, oh, I want to exploit sparseness
in the matrix computation, then sometimes they show the results only with the
matrix computation, corners, minibens.
That's also you guys can do.
Right?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Pick?
Kim, Eun J
Kim, Eun J
[Link]
Basically, all the background architectures are limited by the memory bandwidth
itself, right? There is nothing else that limits that performance, right? We will
see. Depends on how… because, you know, SIMD, and nowadays.
computation is so easy, isn't it? You just make more computation, like a
multiplier, we can have more multipliers. Yeah, yeah. So… My point is, if you,
theoretically assume infinite memory bandwidth, then…
Yeah, you… Yeah, yeah.
But is it true, right? That's not. Okay.
Okay, so this is the graph.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Floating point performance can be found using the hardware specifications.
Many of the corners in multimedia applications do not fit in on-chip caches.
So, peak memory performance is defined by the memory system behind the caches.
One way to find the peak memory performance is to run the stream benchmark.
These figures show the roofline model for NEC SX9 vector processor on the left
side, and Intel Core i7 multi-core on the right side with the stream benchmark.
The vertical y-axis is achievable, floating-point performance from 2 to 256
gigaflops per second.
The horizontal x-axis is automatic intensity, varying from 1 eighth
216 plus per DRAM byte access in both graphs.
Note that the graph is a log-log scale.
For given kernel, we can find a point on the x-axis based on its automatic
intensity. If we drew a vertical line through that point, the performance of the
kernel on that computer must lie somewhere along the line. We can plot horizontal
lines showing peak floating point performance of the computer.
Obviously, actual floating point performance can be no higher than horizontal line
because, that is, hardware limit.
How can you find the peak memory performance?
Because x-axis is a flop per byte, and Y axis is a flop per second, the diagonal
line will be bytes per second, right? Thus, we can plot a third line that gives
maximum floating-point performance that memory system of the computer can support
for a given arithmetic intensity. So we can get attainable
Gigaflops per second as a mean of…
memory bandwidth multiplied by artemary intensity.
And pick floating point minus the point of performance.
The root line sets the upper bound on performance of the kernel depending on its
automatic intensity. If we think of automatic intensity as a pole that hit the
rope.
Either it hits the flat part of the roof, which means the performance is
computationally limited, or
It hits the slanted part of the roof, which means the performance is ultimately
limited by memory bandwidth.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so… When we design hardware.
The first thing, we draw this.
Okay.
And then, for given kernel, you can… we will learn how to calculate automatic
intensity. You find the arithmetic intensity, okay, this is a 4, okay?
you do whatever this is the maximum, right? The hardware support throughput.
And then with a very low intensity, you see your performance will be limited by
memory.
Okay.
So this is, what, I was very impressed when Mark Hill, he's a pre-professor in the
Wisconsin for a long time.
He's been full professor there, even from when I was at Twitch Distance. He's a
very well-known person.
And so then, one of his papers, his students had the visa problem, so he came and
he explained.
Oh my goodness, he was so cleanly explained the concept with just one figure, what
he's doing. And I forgot the detail, I was so amazed by this, so I know the
importance of this graph, okay? In every design early point.
We… especially when we have a data-intensive, we draw this and see, like, because
it's also the counter-dependent
peak, okay?
So… Nowadays, the, the… some of, a lot of my group of,
group people, like Abdullah's group, we work together. We work on this range of
problems.
when we have LMM inference, and then, again, it's… let's say the LMM generates long
sequence of tokens, then it becomes memory intensive, and it's very low in terms
of,
The arithmetic intensity, so then how we deal with the memory comp… memory.
the limitations. So, one student works on compression, one student will try to come
up with the important tokens, and then get rid of others, something like that. So
these are all here, okay, not there.
Why? SIMD, we can have a… we can put a lot of computation powers.
This is where we're going.
All right, so let's move on. Now, your fun stuff, GPU, okay? At least…
try to get familiar with all sunsets and figures, the terminology, because you…
you're… I think, as a computer architecture, you're forced to drop
preference goes to NVIDIA, right?
Okay, hopefully you… some of you got offers.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, let's discuss graphics processing units, GPU
architectures, one of the last variations of SIMD architectures.
People can buy a GPU chip with thousands of parallel floating-point units for a few
hundred dollars and plug it into their desk side PC.
Such affordability and convenience makes high-performance computing available to
many.
The interest in GPU computing blossomed when this potential was combined with the
programming language that made the GPUs easier to program.
Therefore, many programmers of scientific and of multimedia applications today are
pondering whether to use GPUs or CPUs.
Also, the programmers interested in machine learning or Bitcoin generation, GPUs
are currently the preferred platform.
Basically, the GPU is used as a device with the host device CPU, so we're gonna
have a heterogeneous execution model.
The challenge for the GPU programmer is not simply getting good performance on GPU,
but also in coordinating the scheduling of computation on the system processor,
CPU, and the GPU,
Furthermore, the transfer of data between system memory and GPU memory.
As we see later, GPUs have virtually every type of parallelism.
That can be captured by programming environment, such as multithreading, MIMD,
SIMD, even instruction-level peparism.
NVIDIA decided to develop a C-like language programming environment that would
improve productivity of GPU programmers
By attacking both the challenges of heterogeneous computing and the multi… To
catalytes them.
It is called CUDA for Compute Unified Device Architecture.
CUDA produces a C or C++ for system processor host, and a C and C++ director for
GPU device.
The similar programming language is OpenCL, which several companies are developing
to offer a vendor-independent language for multiple platforms.
NVIDIA decided that unifying some of all these forms of parallelism is the CUDA
threat.
Using this lowest level of parallelism as a programming primitive, the compiler and
hardware can gain thousands of CUDA threads together to utilize various styles of
parallelism.
NVIDIA classifies the CUDA programming model as a single instruction, multiple
thread, SIMT.
For reasons we will see soon, these threads are blocked together and executed in a
group of threads called a thread block.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so, thread…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to learn basic knowledge first. Here, a thread is associated with each data
element.
Then threads are organized into blocks. The blocks are organized into a grid.
Note that The GPU, the hardware handles thread management, scheduling, not
application nor operating system.
Kim, Eun J
Kim, Eun J
[Link]
So, the scheduling in CPU, job scheduling done by operating system, whereas GPU,
the job scheduling, we call it WAP, W-A-R-P, is coming, scheduler is in hardware.
Okay, so earlier on, like a…
10 years before boom, you know, media boom, the… I saw a lot of work on the warp
scheduling in our community, because it's a hardware problem.
And the… And we have such power, because,
how can I think? It's coming. We have a huge register file, and number of registers
will limit number of concurrent thread this
processing unit can provide, okay? And we have full control over which register we
are using, which… so this is a really interesting architecture.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
We use NVIDIA systems as our example, as they are representative of GPU
architecture today. Specifically, we follow the terminology of preceding CUDA
parallel programming language and use NVIDIA passcode GPU as an example.
Like vector machines, GPUs work well only with the data-level parallelism.
Both styles have gather-scatter data transfers, And mask registers.
And the GPU processors have even more registers than do vector processors.
Sometimes GPUs implement certain features in hardware that vector processors would
implement in software.
This difference is because vector processors have a scalar.
processor that can execute a software function. Unlike most vector machines, GPUs
also rely on multithreading within a single multithreaded processor to hide memory
latency.
However, efficient code for both vector architectures and GPUs requires programmers
to think in groups of operations.
Let me explain…
Kim, Eun J
Kim, Eun J
[Link]
So, okay, so here, takeaway, maybe…
by the time you have an interview, for example, tomorrow, you want to revisit,
okay? But let's say in 5 years, 10 years, what kind of things do you want to
remember? So, the GPU is sent from vector processor, right?
So you know, I think, it will be in the final, right? Chime, convoy, you know basic
idea of how to group them, and then work together, right? But,
Biggest difference.
of GPU, and even you can contrast with the CPU is this one, okay? We use multi-
threading.
To hide the memory latency.
So you will see, like, when we have a DXY, the kernel program you had in your
final.
Those we vectorize, right? Vectorize. Then each thread block, you have, let's say,
from 0 to 32 elements, okay? All this thread, they will start with the load, can
you see that?
Right? Then it takes time.
If it's a hit, it's fine, but it's… if it is missed, what it is doing, it blocks
that thread block, and the walk scheduler picks another block to run.
Okay.
So you do thread, like, a context switch all the time in hardware level, very fine-
grained, block level. Thread block level.
So, until that is brought in the cache.
they will do other things. So you… let's say you have, the 12, thread blocks with
all the 32 load. You do issue, issue, issue, and then block, block, and then after
12,
The first thread block will have data in the cache, right? So then, when you switch
to that first one, then you can go to the next one, multiplication, like that. Can
you see that?
Every time you have a load, you will do context switch, multi-threading. That is
the one thing you want to remember. So, GPU mainly throughput-oriented design.
Okay, we don't care how, actually, how long it takes to get a data, or how long it
takes to get multiplication. Why? We have a lot of data, we can do pipelining.
Okay?
So we hide that latency with the pipelining, we hide the memory latency with the
threading, multi-threading.
Alright?
This example, I think, will be easily understood with a picture. I don't know when…
I need to…
Remember this, okay?
Okay.
So if you revisit this, open this example figure, and look at that together, okay?
I don't have a way to show together.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Let me explain GPU terminologies with an example shown in figure 4.13 in the
textbook.
Here, the…
we have A equals B multiplied C, okay? They each have 8192 elements along, okay? So
we multiply two vectors together.
The GPU code that works on the whole 8192 element multiplier is called a grid.
To break it out into more manageable size, a grid is composed of thread blocks.
Which with up to 512 elements.
Note that SIND instruction executes 32 elements at a time. With 8192 elements in
the vectors, this example does have 16 thread blocks, because when you divide the
8192,
By 512, you got 16.
So the grid and red block are programming abstractions implemented in GPU hardware
that help programmers organize their CUDA code.
Here, block is similar to a striped mined vector loop with a vector length of 32.
Kim, Eun J
Kim, Eun J
[Link]
So one block will handle from 0 to 31, and second block will be 32 to 63, like
that. So this is a stride mining, right? Can you see that the first element, you
start with 32 difference, right? Stride mining, and then you have a vector loop,
okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
is assigned to a multi-threaded SIMD processor by the thread block scheduler.
Current generation of NBDS GPUs have a 7 to 15 multi-threaded SIMD processors.
Kim, Eun J
Kim, Eun J
[Link]
This is wrong, right?
Anyone search?
So, what is the current GPU? How many threads? 108 or something? 108, okay. Okay,
much bigger, okay? This is a… like, it's… it's been, you know, I got it from
textbook, but textbook is a…
Long time ago. In our computer architecture area, one year is a long time ago.
Okay, we only research with This year published paper.
Okay, if you pick a 2 or 3 years old one, I allow you, okay?
But can you publish that work? No. Because there are some more work between, okay?
So this is, the pride we have coming to the architecture community. So, I… a little
bit, shocked when I first went to… like, my first conference I went is ISCA, even
not HPCA Micro is ISCA. ISCA is the number one, okay, oldest one.
Then, what I felt like a people here is all, like, a look of so proud and arrogant.
Okay, so I survive, you can survive.
Why? I think we are the one leading this world.
And even… even any other field.
like a… see, you see the circuit…
I know some of you, a lot of you working on circuit level, circuit, or system,
operating system, compiler, or application.
The… for example, 10 years ago, I… I… we talk about, in architecture, we talk about
smart NIC, okay? Little interface card, we have a CPU there, we can do things. It
was… we… even… I did that work for my PhD. How many years ago? 20.
Now, if you go to OSDI Operating System Conference, and, you know, that compiler
other circuit. They start… you see the papers on smart technique now.
We are the leading, though.
This area, not only, okay, the way we go for computer science field.
So, people are so arrogant, okay?
I survive, you can survive, okay? It's, it's a good thing, because we… we know what
we try hard, we… our experiment. The thing you should be proud of is the simulator
tool you are using, GSIM, okay? Very detailed, very detailed. We can…
trust the numbers you… you generate. So, if you… present some idea using GSIM.
people will listen to you, okay? There are tools, we openly, widely accepted one,
and these are very detailed and, capture the interaction between different pieces
of systems, so…
So it's a hard field, but, it's a good one. So the… these, 7 to 15 multi-threading
is a way old number. It's more than 100 nowadays, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Each thread is limited to have 64 registers.
Groups of 32 threads combined into a SIND thread.
Kim, Eun J
Kim, Eun J
[Link]
Another name is WARP.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It will map to 16 physical lanes, so up to 32 WAPs are scheduled on a single SIMD
processor. Each WAP has their own PC. Thread scheduler uses a scoreboard to
dispatch WAPs.
Kim, Eun J
Kim, Eun J
[Link]
So, PC value, okay, still, this is for Neumann style. Can you see that? You have a
PC. Okay, Warp each… warp, you have multiple threads, but they need to march
together. So, do you see the importance of…
What? What I'm talking about.
You have a multiple thread, and then you share a PC for a block of a thread.
Can you accommodate the divergence?
of a control flow. Control flow means what? If statement.
Can you have it?
No, right?
So what do we have?
Do you remember?
What do we have?
Masking? Mask register, okay? We do all the computation. That is cheaper, because
you want to have one PC.
Then only later, if it's a necessary computation, we don't save.
Okay? That's the idea. Get one PC for co-op.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
See? Thread Sched scheduler uses scoreboard to dispatch WAPs.
Scoreboard is very similar to Tomato Logarithm board. By definition, no data
dependencies between warps.
Dispatch what's into pipeline, so we can hide memory latency. Whenever you need to
load the memory data, we will do switch between what.
Thread block scheduler schedules blocks two particular SIND processors.
Within each SIMD processor, we have 32 SIMD lanes, so that we can have concurrent
32 operations, and the wide and shallow compare to the vector processor, this
processor.
This… is the example shown figure 4.13 from Textable?
The mapping of a grid and thread blocks and threads of SIMD instructions to a
vector-vector multiply is given, like this, on 8192 elements alone.
Each thread of SIMD instruction calculates 32 elements per instruction, and in this
example, each thread block contains 16 threads of SIMD instructions, and the grid
contains 16 thread blocks.
The hardware thread block scheduler assigns thread blocks to multi-threaded SIMD
processor here, And hardware…
thread scheduler picks which thread of SIMD instructions to run each clock cycle
within a SIMD processor. So it's very different from CPU, where the scheduler is
operating system. Here, we have a hardware thread scheduler.
Only SIND threads in the same thread block can communicate via local memory.
This figure shows a simplified block diagram of multi-threaded SIND processor.
As you can see, this is similar to a vector processor, but it has many parallel
function units instead of a few that are deeply pipelined, as in a vector
processor.
It has 16 SIMD lanes, the SIMD thread scheduler has 64 independent threads of SIMD
instructions that schedules with a table of 64 program counters, PC.
Note that each lane has a separate 1K of 32-bit registers.
Let's look at NVIDIA instruction set architecture.
Kim, Eun J
Kim, Eun J
[Link]
I'm having some…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
I'll give all the…
Kim, Eun J
Kim, Eun J
[Link]
Is it too fast?
Can you see that?
So each one, you have…
the 16 lanes, and then separate. But the thing is, your register files are kind of,
how can I… interleaved, so you… you have an associate
registers 1K32. So it's a… you can think, like, concurrently 1K operations going on
per 32 data element. Each one, you have the SIND lanes, okay? Each… it looks like
one, but there are 32 of them.
This is Simple way… to see the inside GPU.
Any questions?
So, I briefly talked about coalescing.
Before.
Okay, so what coalescing does, let's say your instruction load instruction, okay?
So, because we do stride, so let's say this one is from 0 to 31, and the 32, like
that. Whenever you have a miss, you will put the miss request, right? But then, if
it is consecutive, what do… what do they do? Instead of sending 16 different memory
tests to the memory.
You do quite less.
Okay, you do coalescing how? If it is consecutive, you put them together and then
only send the first address and the size info, how many data you need to read,
okay? That's a coalescing.
And based on this, we also, in our group, many years ago, we had the packet
coalescing idea here. We use…
you know, not only the address quality, but then you can combine the reply from the
interconnection. Your memory… and then the… the earlier when GPU getting starts,
you know, important, the… what I found, in CPU,
since you didn't learn it yet, okay? CPU, the communications between cores, only
caused by
Cash miss, okay? So it's a cash miss, and then they talk to everyone, because the
word you are looking for is going to be everywhere. It's a bank, the multi-banked,
okay?
Here.
The communication pattern is not one-to-one, it's a lot of them is,
many too few. Few too many. Multicast was very important here, okay? Because when
we recast, it is needed not only for one
SIMD is from by others. So, it was mainly to fill the communication patterns. So, I
wrote the proposal and it got funded, because they are different from CPU, right?
So, this is a hint for your time project.
You got the paper, and then how you can get better score, plus ARPA. You first
finish the main idea implementation as soon as possible.
Why? So that you can think, because we are human beings, we are born with critical
thinking.
It doesn't have to be. When you think more and more, you become more creative, more
creative, but then you need to have a time, right? If you're busy to implement it
as it is, you finish it one day before deadline, what can you do? You just run the
benchmarks and plot the graphs, right? You have only one day.
But if you finish 2 weeks before.
You have time to analyze, okay? This is the research part.
Okay, we will see your GitHub commit time, okay?
If you really want to have a plus ARPA score for TAM project, I need to see the
evidence, okay? You need to finishing up your implementation 2 or 3 weeks before,
and then you test with the different benchmarks, and then you should… you will be
curious. We human beings, right? We are born with curiosity.
Oh, what happened with that? No?
You are here because you have academic intellectual curiosity, right? No? You are
here just to get A?
Or, anyway, to get A, okay? Even you though you are not curious, okay? I'm doing
this, it's so boring job, I'm not excited about at all, but I'm doing this for… to
get A, okay? This is the way you get… you can get A. Initially.
And then pretend you are, like, EJ's mind.
Okay, how… how EJ would think about this, okay? EJ will be curious.
Okay, alright. Oh, if it is a single-thread benchmarks they use.
how it works with the multi-thread. I will look at configurations. Like, let's say
they have 8K memory, or 8K last-level cache. If I double, if I reduce, if I change
away, how it works? Because you guys, a lot of you choose the cache replacement
policy, right?
Not always, you should be curious. Real pure curiosity you should have.
I can tell, okay? I can tell whether you had those things. So you will have, your
machines are free, right? Run all the time, and get some data, and then see it. And
then, with those, maybe you can have some idea, right?
But in order to do that, you really need to finish much earlier. I know you are…
Now, it's Thanksgiving. I'm sorry, but I don't know if you can afford the full 3
days off of Thanksgiving.
We don't have a class, someone asked, because I checked the university schedule, we
are not supposed to have a class on Wednesday, isn't it? So we don't have a class,
but we will have a class on Monday.
Customer So… This is it.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Excellent.
Let's look at NVIDIA instruction set architecture. The instruction set target of
all the NVIDIA compilers is also an abstraction of a hardware instruction set.
PTX parallel thread execution provides a stable instruction set for compilers, as
well as compatibility across generations of GPUs.
The hardware instruction set is hidden from programmer.
PTX instructions describe the operations on a single CUDA thread, and usually map
one-to-one with hardware instructions, but one PTX instruction can expand to many
machine instructions, also one machine instructions to many PTX instructions.
Note that PTX use a virtual register, unlimited number of write-once registers, and
the compiler must run a register allocation procedure to map the PTX registers to
affix the number of read-write hardware registers available on the actual device.
Although there is some similarity between x86 microarchitecture and PTX, The
difference is…
Kim, Eun J
Kim, Eun J
[Link]
One thing I want to point out, this PTX, the programming model, provide the virtual
register names, so the thing we learned in Tomasolo and hardware speculation, we
use pointer, right?
But they use virtual registers, okay? So they use a virtual register as a pointer
of, things, because…
remember, Tomasovo and hardware speculation, you don't want to update the register
status until, you know, it's clear, right? But here, do we have that kind of a
problem a lot? No. We get rid of the if statement. Everything goes, you know,
marching.
So, we don't have to have a lot of pointers, but there are a lot of right after
right.
dependencies which, you know, reduce the parallelism you can achieve. So the
compiler, we will use… we will… we have a lot of virtual…
registers, we allow to have them have a renaming done, okay? We use a lot of
virtual registers, but then later, physio to… to… in…
Mapping between virtual registers and physical registers will be done in compiler,
taken care of.
So, If you…
try to… I never, okay, do coding with a CUDA. The UC is much easier, okay? A lot of
data dependencies and, you know, those that you don't have to worry about.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Is that the translation?
with x86 happen in runtime.
Whereas, PTX translation happens in software, and the loaded time on a GPU.
This sequence of PTX instructions, good example, is for one iteration of the
previous DAXPY loop.
Example we handled before.
Here.
CUDA programming model assigns one CUDA thread to each loop iteration and offers a
unique identifier number to each thread block, block IDX here, and one to each CUDA
thread within a block, which is noted thread T-H-R-E-D-I-D-X index.
Therefore, it creates 8192 code address threads, and they use the unique numbers to
address each element within the array. So there is no incrementing or branching
code.
The first three PTX instructions calculate the unique element byte offset in R8,
which is added to the base of the array.
Then the following PTX instructions load two double precision floating-point
operands, multiply and add them, then store the sum.
Note that, unlike vector architecture, GPUs don't have separate instructions for
sequential data transfer, stride data transfer, scattered gather data transfer. All
it has, gather, scatter.
Just like vector architectures, GPU handles the if statement conditional branching
very similar, only it uses more hardware. In addition to predicate registers, they
are using internal tasks.
A branch synchronization stack, and instruction markers to manage when a branch is
diverts into multiple execution paths, and when the paths converge.
At the PTX assembler level, control flow of one CUDA thread is described by the PTX
instructions, branch, call, return, and exit, plus individual per-thread lane
predication of int instruction specified programmer, with the per-thread length
one-bit predicate registers.
At the GPU hardware instruction level, control flow includes branch, jump, jump
index, call, call index, return, exit, special instructions that manage the branch
synchronization stack.
GPU hardware provides each SIMD thread with its own stack. A stack entry contains
an identifier token, a target instruction address, and target thread active mask.
There are GPU special instructions that push stack entries for a SIMD thread, and
special instructions and instruction markers that pop a stack entry or unwind the
stack to a specific entry and branch to the target instruction address with a
target thread active mask.
GPU hardware instructions also have an individual lane predication specified with a
one-bit predicate register for each lane.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so here.
The vector processor, the way we discussed it, there was a masked… mask, right?
Only is, you know, handle one line of code.
Isn't it? But what if there are jumpings or call function and return? There are not
only one line of code, multiple line.
And then… then there is an imbalance between if and the else case. If they are long
and the else is short, okay? So what they do, they execute both, okay?
they use stack, okay? Think about it, how we retrieve whenever this control flow
meets at the time, right? There is an if and else, and then meet. And then there…
maybe there is a nested if.
Okay? So if you handle nested, then stack is the best way. The most recent one
path, and then you meet, and then you handle the next midpoint, okay?
So that's, they handle diversions of a branch. They provide more flexibility than
vector, okay? Because, think about it, CUDA, when they released CUDA, they try to
make… this is GPU architecture for general-purpose coding, okay?
But it's not a good idea to have a general code, right? But there are… so with
this, actually, very general type of code, you have a lot of call return, and the
if-else statement, the imbalanced.
The divergence of a control flow, it can be handled, okay? It can be handled.
But it will degrade the performance, okay? Because you need to go through the sync
between the merging point.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The PTX assembler typically optimizes simple outer-level if statement coded with
PTX branch instructions to solely predicate the GPU instructions without any GPU
branch instructions.
A more complicated control flow often results in a mixture of predication and GPU
branch instructions with special instructions, and markers that use branch
synchronization stack to push a stack entry when some lanes
Branch to the target address, while others fall through. NVIDIA says a branch
diverge when this happens. This mixture is also used when SIMD lane executes
synchronization marker or converges, which pops a stack entry and branches to the
stack entry address with the stack entry thread active mask.
Let's look at the example code. This is very similar, the one we saw in vector
architecture example.
This if statement could compile to the following, as you can see here, PTX
instructions, assuming that R8 already has the scale thread ID,
with push, SRS with push here, let me underline here… This one, and comp.
Pop, okay?
Indicating the branch synchronization markers inserted by the PTS assembler that
push the old mask to complement the current mask, and pop to restore the old mask.
Normally, all instructions in if-then-else statements are executed by SIMD
processor, so you can think all the branches, parcel branches, will be executed. It
is just that only some of SIMD lanes that are enabled for the then instructions,
and some lanes for the else instructions.
Common case that individual lanes agree on predicate branch, such as branching on a
parameter value that is the same for all the lanes, so that all active ammascopies
are 0 or all are ones.
So then, Brenchie skips the then instructions, or else instructions.
Let me introduce to you NVIDIA GPU memory structures you can find…
Kim, Eun J
Kim, Eun J
[Link]
Nope.
Did you get the main idea? Look at this, there is an if and else, and there are the
if statement, this handled, you actually execute both of them.
Right? Then, at the end, there is a mask and tab. There are more, like, hardware
involved, not only registered, there is a tag. They…
change X value based on this condition.
So this example is complicated than the example I show in vector process, because
GPU handles more general branch divergence.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Instructions, or else instructions.
Let me introduce to you NVIDIA GPU memory structures. You can find the figure.
In the textbook, figure 4.18, page 327.
Here.
Kim, Eun J
Kim, Eun J
[Link]
I need to open it.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Each SIMD lane has its own private section of off-chip DRAM.
Which is called private memory.
It is used for the stack frame for spilling registers, and for private variables
that don't fit in the registers. SIMD lanes do not share private memories, GPUs
cache this private memory in the L1 and L2 caches to aid the register spilling and
to speed up function calls.
Another level is local memory. We call on-chim memory that is local to each multi-
threaded SIND processor, Local memory.
It is a small scratch pad memory with a low latency and a high bandwidth.
Where a program can store data that needs to be reused either by the same thread or
another thread in the same thread block.
Finally, we have GPU memory.
We call the off-chip DRAM shared by the whole GPU and whole thread block GPU
memory.
The hour vector multiply example used the only GPU memory.
System processor, called a host, can read or write GPU memory. Local memory is
unavailable to the host, so it is private to each multi-thread SND processor.
Private memories are also available to the host as well.
Kim, Eun J
Kim, Eun J
[Link]
So, as you can see here, there is a GPU memory, okay, it's different from CPU
memory. So, earlier on, we studied a lot how to, you know, speed up, because
whenever you… CPU… because the GPU is a coprocessor, right?
CPU will ask, okay, this part of the code will be executed by GPU. Then you need to
send the data, and the
the program itself to GPU memory, so there is a data transfer.
And then GPU use, okay? So, AMD, when they have their own way of integrating with
the GPU and BDR there, they, earlier, they…
had some prototype of shared memory, okay? Then shared memory, think about it, when
you have a shared memory, GPU requires a
tons of memory, because it's memory-intensive, whereas CPU, less, but it's latency-
oriented, right? The time you get the data is very critical for the CPU. And then
CPU, actually, a lot of time, is a master program you're running, and if you
experience about the performance degradation in CPU, that's actually…
Much harder to solve in GPU delay.
And the delay you experience in GPU, because the GPU does what?
multi-threading to hide the memory latency. So those, the things are very… were
very important, and still is important. So here.
you have a CPU memory, and then local memory shared by the many SIMD lanes, but
then each SIMD lane, they have their private memory, okay? So, from shared memory
and GPU memory, when the different, the
thread blocks use the same data. A lot of them, let's say AI, you update the array
value plus C, like that. That C is shared by all the thread blocks, right? So you…
you need to broadcast, you need to
Multicast to every SIMD lanes, okay? And then, then you might, you may have a
question after I talk about the next chapter.
How about you, whenever you have a shared memory, how about coherence? Okay. But a
lot of time, you see these SIND applications.
The shared data is actually read-only.
the corner I just explained, Because we talk about this later, with the example
where it's…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Available to the host as well.
Kim, Eun J
Kim, Eun J
[Link]
So, as you can see… oh, okay, here.
So… You have a huge…
array, but then that is just simply color rendering or something you users see. And
this variable is read-only, and even it is… it can be matrix big data, but if it is
this read-only and shared by all these,
you know, big array, then you can blow the cache, okay? So, the GPU at earlier
design stage, they had a very small cache. So, the earlier private memory, and then
even L1, L2,
they don't have a big cache. Because a cache, you need to have it when you see the
locality, right? But you don't see much locality. But instead of having large level
cache, large size cache, they provide
big register files. A lot of data should be holded in the register, not the cache,
okay?
That's a big difference.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Well, SIMD processor is like a vector processor. The multiple SIND processors in
GPU act as independent MIMD cores, just as many vector computers have multiple
vector processors.
This view will consider the NVIDIA Tesla P100 as a 56-core machine with hardware
support for multithreading.
Where each core has 64 lanes. The biggest difference is multi-threading, which is
fundamentally
different from vector processor. Okay, vector processor doesn't have
Multithreading, whereas a GPU is mainly work for multi-threading.
Let's look at registers.
Risk 5 vector, register 5, is in our…
Textable hold entire vectors, whereas GPU distributes across the register of all
SIMD lanes.
RISC-5 processor has a 32 vector register, so with perhaps 32 elements, or 1K
elements total. A GPU thread of a SIEMD instruction has up to
256 registers with 32 elements each, or 8192 elements. These extra GPU registers
supported multi-threading.
Kim, Eun J
Kim, Eun J
[Link]
Even this number, you need to… if you have an interview, you should update this
number. This 256 is all the number.
Okay, you need to find the DGX through, like, how many registers are humongous,
okay? Because how… what is a thread block you can handle at the same time will be
determined by number of registers you are having, okay?
All data will be loaded in the vector registers, and then how many vector registers
you have will limit how many elements you can do at the same time, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
In reality, there are many more lanes in GPUs, so GPU chimes are much shorter.
While a vector processor might have a short
2A terrains, the vector length of, say, 32 makes a chime of 4 to 16 clock cycles.
Multi-threaded SIMD processor might have 8 to 16 lanes, so a SIMD thread is 32
elements wide, so GPU chime would be just 2 or 4.
This difference is why we use SIMD processor as the more descriptive term, because
it's closer to an SIMD design than it is a traditional happen-in vector processor
design.
Closest GPU turn to a vectorized loop is a grid, and a PTX instruction is the
closest to a vector instruction, because a SIMD thread broadcasts a PTX instruction
to all SIMD lanes.
Also…
Kim, Eun J
Kim, Eun J
[Link]
Do you recall what is the time?
you define convoy, which the instruction set can be executed at the same time. So,
you know, compared to vector architecture, this SIMD processor, you have a shorter
time, because you can have more convoys.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So all GPU loads are gathered instructions, and all GPU store are scattered
instructions. So it's basically store… loads are gather and scatter operations.
Let's see similarities and differences between multimedia SIMD extension computers,
like from Intel series we discussed as a second variation.
And the GPUs, how they're different.
Actually, at high-level multi-core computers with the multimedia SIND extensions do
share a lot of similarity with the GPUs, but they are different.
First, both are multiprocessors whose processor uses multiple SMD lanes, although
GPUs have more processors and more lanes.
Second, both use hardware multithreading to improve processor utilization, although
GPUs have more hardware support for many more threads.
Third.
Both have a roughly 2 to 1 performance ratio between peak performance of a single
precision and double precision plotting point diasmatic.
Both use caches, although GPUs use smaller streaming caches, and multi-core
computers use much larger multi-level caches that try to contain whole working set
completely.
We also use the 64-bit address spaces.
Although the physical main memory is much smaller in GPU.
Lastly, unlike GPUs, multimedia SIMD instructions historically, they did not
support the gather-scatter memory accesses.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so we have 15 more minutes, so let's do Prince.
I suppose this is the easiest one. I want you to get familiar with the terminology.
It won't be… oh, thank you.
Wait a second. Alright, I'm sorry.
27. Where is the 27?
Oh, didn't do 20 seconds?
Oh, this is the easiest one. You can quickly do. What is it?
Keep reminders.
It's been difficult.
2, 3… Okay, those 6 operations on… three elements.
So, for one load, how many operations do you do here?
how many, how many variables… so for… let's say this is a corner, how many
variations…
Variables you loot, you need.
C2, A2, and B2, right? Isn't it?
How many?
So, one… you can just count different variables. How many variables do you see?
Oh, there it is. Six, right? 6 variables. It's an ABC, and then each one, they have
a real part and the imagination part, right? It's a complex number representation.
Okay, so 6 variables.
How many computations do you do?
What's your question? Mike's question was…
Do we only count the number of memory loads, or the number of memory operations?
Memory load. No, no, no, load and store together. So, so you just counted the
different variables.
up here, here. Because, for example, for this computation, what do you do? You need
to load this, load this, load this, load this, right? And then you need to write to
here, 5, right?
And then this one appear here, so you don't count.
How about this one? You don't count. This, this, right? And then you have 5 plus 1,
6.
Is it a difficult question? How many different load stores you will have, okay?
Isn't it? Because this corner, you have this… you… you will have a load value of
this, and then the… that value will be in the register. So for the second one, you
will use a register value, not load, right?
Question?
Okay, if I ask, can you write down RISC-V style of code?
for this assembly code, then you will be clear. Can you write down code?
What is it?
you will load, load, load, load, right? And then, let's say R1, R2, you multiply.
And then R2, R3, R3, R4, you multiply, and then, let's say the result was R5, 5
minus 6, and then Ruby…
store to do this, right? So 5 different load and store. How about this next one? Do
you need to load again? It is in R1, R… R4, right, already? So you are using
register to
But this one, you need to restore back. So, 6 different memory accesses happen.
For how many operations you do? Computations.
Good evening.
Six?
Six, right?
So, for each… You do 6.
It will clearly be in China.
So, single precision, so the number of bytes will be 6 multiplied 4, right? 4
bytes, okay? So, intensity is the number of bytes, number of operations divided by
number of bytes. So, it would be 6 divided by 6 multiplied 4, so 1 fourth.
Can you see that?
The arithmetic intensity is low, only one-fourth.
Did you use it only one more time?
Can you see that?
You bet.
Yes, no. Okay, arithmetic intensity, the slide in the slide, I briefly talked about
it, maybe you missed it, okay, is number of operation
Huh?
Number the purr, per… a bite.
When you get, one byte, how many operations do you do, okay?
So, you just simply count the number of operations. Here, you do 6 operations,
right? Multiply, multiply, what? Multiply, and minus 6. And then divided by total
data you are getting is 6 different variables, each one is 4 bytes.
So this will be 1 fourth, okay? That is automatic intensities.
I can't tell, because you just load one data, and then you use only one more time.
Okay. So, let's say if you have… Being Matrix.
and multiplication, what are the… what are the arithmetic intensity here? If, let's
say, you have n by n matrix?
You are… you are… the code will be… what is it? Each, H is C, I, J,
will be… the… What is it? How do I write in the mathematical way?
A… Bye.
The K… A… Okay?
Multiply… P… K… J. Okay. So K equals 2? 02N, okay.
So you have for loop. 4i goes to N, J goes to N. Okay, so what is arithmetic
intensity here? I wanted to go another…
next quiz, but they… you sound like, ugh, you don't understand. This is the
simplest concept! Okay, tell me, how many words?
Okay, so how many, how many computations do you have here? Let's say any goal…
So, any goal, do we do just simple, small one, two. Okay, two.
So, how many multiplication you do here?
Precinct with Twinpoints.
1, 2, right? How many additions do you do?
1, so 5 for each element, right?
Oh, each one.
Total 6?
Number of operations total 6?
Okay.
total data, right? So…
So, data, number of data is a 6, and then each is a single precision is 4. Okay, so
let's… let me… let me get the number of operations. So for each element, you see
multiply with this.
things, and then add 5 operations, right? So, but then, it needs to repeat, Four
times, right? 20.
Can you see that?
Computation intensity here.
So, you don't get 20? Okay, just so quickly, I really want to do the next quiz.
Matrix computation, matrix multiplication, you have two elements, two elements,
what you do? You multiply, can you count one?
And then you multiply.
1, 2, right? And what you do do to determine this C00? You need to add this
together, 3, right? So for each CC0CIJ, you need to have 3 operations. How many
elements you have?
4, right? It's a 12. Okay, sorry. That's why. It's a 12, okay. It's a 12 total
operations you are having. Can I… can I generalize this formula with the n by n
computation?
Pickily, can you… anyone… I know you're smarter than me.
The only one's N, and the other one is N by M. Right? Oh, so we're going carefully
to get an M by M at the end. So it's M… Is it true or not?
So for each… So there are… For each C over IJ, You need to do this!
operation, isn't it? N number of multiplications, n minus 1 summation. Isn't it?
Very good. Yeah.
This is per element you need to do, n plus n minus 1. And then how many elements
you are having, and square. Then the number of variables you are having.
the result is M squared.
N squared, 3 of them, isn't it? This is the formula. I love this kind of thing,
okay? Okay, let me go quickly. So you can, you can plot the graph when we have an n
function of an arithmetic intensity of N, right?
Can you see that? So… so here, KVCache is all about N by N matrix, right?
you can have a graph, right? You can see how this automatic intensity will change
over the sequence number. It's a very simple thing you can do from basic idea.
Alright, let's move on.
So, I need to take a picture. I like this. I shouldn't put it in the final exam.
Okay. This loop is unrelated. Oh, okay. Okay, so…
Oh, 28. I need to do 28.
Okay, so this… Do you mind if I take 3 more minutes?
you should try. This is a simple… just assemble every terminology you should
remember, and then get the number, okay? It's just a parallelaser you can get.
That is a crazy long call. So I will underline, then assign the instruction as a
32, 32 together, and then it's 8 lanes, okay?
It really isn't.
32 per cycle, but then you have this number. So we can get… So, 70, 80, 20?
Issue rate is this. Okay, last bit of this.
Okay, compute throughput.
So if we ignore all 80% of thread active, 70% is executed, and like that, if we
just simply follow this number, what is the throughput you are having?
You have 8 processors. Each processor has 8 lanes, so you have 80 concurrent
execution going on.
Clock speed. What is clock speed?
is 1.5 gigahertz, means… just… this is the number, okay?
But then, let's interpret,
those active… active threats, okay? So, it has a thread, because this is the number
of threats you are having. Only 80% is…
Active, so you multiply 80, okay, 80%?
And then.
Among even then, 70% of all SMD instructions executed are single precision, so 30%
we shouldn't count as this…
Giga operation, because this is the floating point operation, so it's a 7% should
be, calculated in the calculation, okay? And then the issue rate, because this
will… throughput will be limited by issue rate, so it'll be
point A5, okay? These are answers for the A.
They tell me how this would change to B.
So, first, this change to from 10 to 16, it will change to… Two times, right? Oh,
easy.
Oh, okay, this change to 16. Everything is the same, it's a double, right? You
should see the speedup will be double. But second.
SIMD increased number of SIMD 10 to 15. Then, you will have 15 divided by 10 speed
up, you will get, because others are the same.
Okay, how about third one? Adding cache effectively reduced memory latency.
So issue rates changed too.
This also, others are saying, this also need to change to 35, 95. That's the number
you get, okay? Then, ratio.
Ta!
So we will be back on Monday.
doing the last set of, things, but then we will move to multiprocessor MIND
architecture.
Thank you.
Oh, oh, yeah, yeah, yeah, G5.
Do they know, like, what would they get?
Like, ego cycles for you.
Yeah, okay, we're gonna go… I just wrote down the numbers there. If these numbers
match, we're gonna figure out what units are after. I think he just… Okay, well,
we're gonna go figure it out.
This is, like, the longest quiz. Oh, by the way, I'm just starting.
So for… so for the second one, is that for me?
Second one, like the arrows she spits out. She has a number down there. What is
that for?
Yeah, we don't use music.
15, 16. Yeah, 10 will become 15. No.
Oh, it's for… we have to do it for each of them separately.
It's not all of the assumptions together. Okay, then you can do all the assumptions
together. 10 will become 15, okay. Sorry!
The poor cycle doesn't actually happen to be effective.
Nov 24:
So, Saturday morning, so that's…
Hi, I was telling Tapia that we can, like, shit-start doing, like, the sessions
with, like, to study out of OED experiences. Yeah.
present, we may need to… I mean, that was very helpful.
I found them helpful, too, because then I could, like, remember everything really,
really well. So the last day of class is December 8th, right? And then, we have a
week until our final. It's two weeks from today.
The 30th, instead of Monday, Monday, next Monday. So, we have a week from Monday to
turn on our term projects. Let's begin.
Where are we? We stopped… Loop-level parallelism, right?
So this is the last one. So we will be done with the chapter… what was it? Chapter
4, and then we will go to Chapter 5, okay?
Thanks, hello!
Okay, it's okay.
This part is easy. We already done a lot to identify the data hazards, okay? It's
nothing but we identify data hazards. If there is a true data dependency, you
cannot do loop-level parallelism, okay? We try to figure out if that loop is
parallelizable or not.
If not, renaming helps, so your quiz is about renaming, okay?
We learned that NVIDIA GPU has tons of many registers, and then users use virtual
registers, actually renaming them, okay?
So that we… we get rid of all unnecessary hazards, which prevent feralization of a
loop, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this last set of slides from Chapter 4, we're gonna discuss loop-level
parallelism, compiler technology to discover and enhance loop-level parallelism for
vectorization.
Here, we discuss compiler technology used for discovering the amount of parallelism
that we can exploit in a program, as well as the hardware support for these
compiler techniques.
Kim, Eun J
Kim, Eun J
[Link]
We define precisely when a loop is parallel.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
or vectorizable.
How a dependency can prevent a loop from being parallel, and techniques for
eliminating some types of dependencies.
The analysis of loop-level parallelism focuses on determining whether data accesses
in later iterations are dependent on data values produced in earlier iterations.
Such dependency is called a loop-carried dependency.
All the examples we saw from earlier.
chapters. Actually, there is no loop carrier dependency.
Kim, Eun J
Kim, Eun J
[Link]
So in final, if I ask to explain loop carry the dependency, can you write only one
or two sentences, okay? You should be brief, we have many students, so I don't like
that kind of… that type of question, but this final
I want to cover all the materials I covered right after meeting time, and there are
a lot of things I cannot make all the, you know, real number thing. So, I have to
ask you to briefly explain things.
So… as it goes, you know, whenever I mention this, or you… you think this is a…
key concept you should know, then write down, and then use your own words and
summarize, okay? A lot of times, you want to contrast
the concept, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So, we can do local level parallelizable.
This code assumes that A, B, and C are distinct, no overlapping arrays.
In practice, the arrays may sometimes be the same or overlap, but here, assume
there is no overlapping. Okay, it's independent. Then let's see…
What kind of data dependency is among the statements S1 and S2 in the loop?
You can find two different dependencies here. S1 uses a value computed by S1 in an
earlier iteration, because iteration i computes IAI plus 1, which is read in the
iteration i plus 1.
The same is true for S2.
In BI, right, and the BI plus 1.
S2 uses the value AI plus 1, computed by S1 in the same iteration.
These two dependencies are distinct and have different effects. To see how they are
different.
Let's assume that only one of these dependencies exists at a time.
Because the dependency of statement 1 is on an earlier iteration of S1, this
dependency is a loop carried, a loop carried.
This partnership forces excessive iterations of this loop.
to execute in serial, right? It's serialized between iterations.
The next dependency, S2 depending on S1, is within an iteration.
And… It is not loop-carried, thus?
If this were the only dependency, multiple iterations of the loop would execute in
parallel, as long as each pair of statements in an iteration were kept in order.
So we see this kind of dependence in the example earlier, right? When we do a root-
only, we can make a parallelism there. We can reduce the number of stalls.
These inter-loop dependencies are common.
It is also possible to have local carry dependencies.
That does not prevent parallelism, okay? So, let's look at the next example.
Kim, Eun J
Kim, Eun J
[Link]
Okay, here, what you… you can imagine, there are, like, you have a 100, right? 100
elements.
In parallel.
So you… you give one iteration, one thread to each accelerator, each GPU.
Can 100 executions can be done in parallel.
Then you can easily see this statement to prevent that, because this i-th element,
you need to get value from.
So I use, let's say, 3 people. So I, I plus one, like, I minus 1, I, I plus 1. She
needs her results, right? This cannot be actually in parallel. So, that's the main
thing you can imagine, how you can make it parallelized.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
In this example, statement S1 uses the value assigned in the previous iterations by
statement S2. So there is a loop-carried dependency between S2 and S1.
However, this loop carry dependency You can see this?
This loop can be made parallel. Unlike the earlier loop, this dependency is not
circular, neither statement depends on itself.
And although S1 depends on S2, S2 does not depend on S1, it's not succulent.
Kim, Eun J
Kim, Eun J
[Link]
So, this technique, you know, Open time for us human beings.
how I approach… when you have this, you do loophole-alling.
you can rewrite the code without for loop, and then it will start 0, and then 1, 0,
1, 0, 1, 1, 2, 2, 3, like that, right? So if you group the
instead of this and that, you have another one here, AI plus one. How about that?
Can you see that?
That will get rid of.
dependency between loop. This is how they rewrote. Can you see that?
So, AB pumps, right? And then who… what is the next one, if you do it on all?
A plus 1, right? A plus 1 will be there. So if you…
bundle this together, like, to get B plus 1, BI plus 1, AI plus 1, this together,
then your index is in that state index read, right? Not previous one.
So that becomes a loop.
Independent.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Loop is a parallel if we can be written without a cycle in the dependencies,
because the absence of a cycle means that dependency
Give a partial ordering on the statement.
Although there are no circular dependencies in the preceding loop, it must be
transformed to conform to the partial ordering and expose the parallelism.
Kim, Eun J
Kim, Eun J
[Link]
So it's easy, right?
The compiler does these things to make all, you know, GPUs, building parallel.
Is it okay if I ask you to re-transform the code?
To maximize parallelism, Like this.
What's that?
easiest one.
take four GPUs in parallel, give one iteration to each, see if they can be in
parallel. If not, then what you do? This technique, you do loopholing.
So when you re-Google, this is the first one out, and the last one is out, right?
That's what you can do.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Two observations are critical to this transformation. First, there is no dependency
from S1 to S2. If they were, then there would be a cycle in the dependencies, and
the loop would not be parallel.
Because this other dependency is absent, interchanging the two statements will not
affect the execution of S2.
The second observation is that
On the first iteration of the loop, the statement S2 depends on the value of B0
computed prior to initiating the loop.
These two observations allow us to replace preceding loop with the following code,
like this one, okay?
The dependencies between the two statements is no longer loop-carried, so the
iterations of the loop may be overlapped.
Provided a statement in each iteration N are kept in order.
We need to begin to find all loop-carried dependencies. However, this dependency
information can be inexact.
Let's look at example 4. Here, the second reference to A in this example needs not
to be translated to a load instruction, because we know that the value is computed
and stored by the previous statement, thus the second reference to A can simply be
a reference to the register.
Performing this optimization requires knowing that the two references are always to
the same memory location, that there is no intervening across the same location.
Normally, data dependency analysis tells that only one reference may depend on
another. A more complex analysis is required to determine that two references must
be the exact same address.
In this example, because two references are in the same basic block.
This simple analysis will be fine.
That's example 5, we see here the recurrence happened. A recurrence occurs when a
variable is defined based on the value of that variable in an earlier iteration,
usually one immediately preceding.
Detecting a recurrence can be important for two reasons. One.
Some architectures have a special support for executing recurrence. The second, in
the instruction 11 parallelism, it may still be possible to exploit a fair amount
of parallelism.
Finding the dependencies in a program is very important.
Kim, Eun J
Kim, Eun J
[Link]
So the earlier AI, AI, AI is the same, so you, you can immediately see these are
the same, right? But then, what about the, general form? So this is a generalized.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Fools to determine which loops might contain parallelism or to eliminate name
dependencies.
The complexity of dependency analysis arises also because of the presence of arrays
and pointers in the language such as C or C++, or pass by reference parameter
passing in the putran. Because scale variables' references explicitly defer to a
name, they can usually be analyzed quickly.
Because the pointers and reference parameters causing some complications and
uncertainty in this analysis.
How does a compiler detect dependency in general? Nearly all dependence analysis
algorithms work on the assumption that array indexes are fine.
Simplest terms, a one-dimensional array index is fine if it can be written in the
form of AI plus B, where A and B are constants, and I is the loop index variable.
The index of a multidimensional array is offline if the index in each dimension is
a fine.
However, sparse array accesses, which typically have the form of, let's say I'm
right down here. If you have, this way, you see scatter and gather example, the
index of index, this is a sparse access.
One of the major examples of a non-of-fine access.
Determining whether there is a dependency between two references to the same array
in a loop is equivalent to determining whether two affine functions can have the
identical value for different indexes.
between the bound of the loop. For example, if we have stored to an array element
with index value AI plus B, and loaded from the same array with index value CI plus
D.
Kim, Eun J
Kim, Eun J
[Link]
Where i is the full loop index variable.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
which run from M to N. Then, a dependency exists if they
follow two conditions. First, there are two iteration indexes, J and K, that are
both within the limits of a for loop.
Second, the loop stores into an array, indexes by a J plus B, and later fetches
from the same array element when it is indexed by CK plus D. That is, a J plus B
equal to CK plus D.
Generally, we cannot determine whether dependence exists at compilation time.
For example, the value of A, B, C, and D may not be known during compilation time.
In other cases, dependency tests may be very expensive, and but undesirable at
compilation time.
However, many programs
contain primary simple indexes, where A, B, C, and D are constant. For these cases,
it is possible to devise a reasonable compiler time test for dependency.
As an example, a simple, sufficient test for the absence of dependency is the
greatest common divisor.
GCD test.
It is based on the observation that if a loop carrier dependency exists, then GCD
Sigma A, in previous example, must divide by D minus B.
Let's look at an example. We can use it.
Kim, Eun J
Kim, Eun J
[Link]
So, did you get it?
You have a two affine functions here.
Then, when you have this loop, you list out all the index.
What do we try to catch?
If…
Like, earlier, it was, let's say, 20. Then later, here, 20 appears. Then there is a
dependence, isn't it? Right? It rides somewhere it never read by this side, so it's
independent, isn't it?
That's what we are doing. So, you have this C, A, and C, Common divisor is 2,
right?
Then, if it is divisible by…
3 minus 0, or 0 minus 3, then it will have overlap value in the, you know, in
overrun.
Okay, but it won't have this one.
Okay, then we… we can… what… what it means, if we have 100…
Processing elements, this… each line can be executed
In parallel. In one action, we can do all 100 operations at the same time, okay?
This is all about
paralyzation.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
GCD, common… greatest common divisor test to determine whether dependencies exist
or not. Given the value A equal 2, B equal 3, C equal 2, and B equal 0, then…
GCD of A and C equal 2, and, you know, D minus B equal minus 3. So 2 is not
divisible by minus 3, so there is no dependency. In other words, you… you don't see
the inter-iteration dependency. Each reference in each iteration is independent
from other iterations, so that we can… we…
Can do loop-level parallelism.
In general, determining whether a dependency actually exists is NP-complete.
Recently, approaches using hierarchy of exact tests increase in generality, and
costs have been shown to be both accurate and efficiency. Note that, although the
general case is NP-complete.
There exists exact tests for restricted situations that are much cheaper.
In addition to detecting presence of a dependency, compiler wants to classify the
type of dependency.
The classification allows a compiler to recognize the name dependencies and
eliminate them at compilation time by renaming.
Okay, here, this is the example. I will put this question in the quiz so that you
can go over. In the quiz, what you need to do, identify all different dependencies.
Let me remind you…
Kim, Eun J
Kim, Eun J
[Link]
Let's what you do. Because you learn, right?
You know what are the data hazards? Three kinds we learned. What is it?
Read after right, right after I, and right after read, right? So, in Tomasolo and
hardware speculation, what did you do?
What's a key idea there?
Among three hazards, we treat differently. What, what happened?
Eliminates the right of the bead and right off the beat.
Two, two hazards. We eliminate them. How?
Renaming… how renaming done in technical law and hardware speculation?
Mindful.
So, reservation station, name as used as a pointer. So, pointer Provides the
renaming, right?
Instead of, one variable, you have a pointer to the same object, isn't it?
You didn't talk about the first one. Read after right. What happened?
Read after right, you cannot do anything. Change the order, isn't it?
So you… the algorithm makes sure you wait until
Read after write has resolved. How? Writing done. Then you can immediately use that
value, isn't it? That's the key idea of a tomato library, isn't it? Wherever you
have a read after write, you're waiting. You have a reservation station, bench,
waiting, right?
The reservation station used as a pointer, so the right after I, right after read,
has been
get rid of, right? So here.
Can you identify all the hazards?
It can be in the final, right? Because it's overlapped with the earlier concept you
learned. This is very important. Do it now.
It's a quiz, 29. You, you guys ask.
You guys can do it, this is not a difficult one. You already learned during… Thank
you.
First, the half semester.
Antide-dependency is ready to read, and regular dependencies read at the frame.
If you watch, you won't get it. Do it.
You… Make your hands busy.
You did? Okay, can I see your eyes? Thank you, Anna.
A day, today, yeah. So good, good.
So, all of you done? How many hazards did you identify?
1, 2, 3, 4, 5. I found 5.
You didn't… I didn't change the deadline, but you there didn't do, right? You said
I… definitely, I will change the deadline for Phase 29? I changed…
But I changed it today, isn't it? What's it? Okay, so how many of you submit that?
Thank you. So, how many hazards did you find? Five. Okay. Then…
Okay, now let's do renaming, okay?
So… The main thing, any read after write, you cannot name differently, right?
Because there is a value transfer.
So, rename only right after I, right after read, but make sure any read after right
hazards, you still use the same thing, okay?
From now on, any… piece we are doing is a potential final question.
I'm lazy, I'm reused, successful.
Are they made money on it.
So you are having this… what is this? Your read after I, so…
Right after it. So this can be changed, okay?
So, the solution I have… I will use this XYZ.
And then I will use a P, Q,
are this older, okay? I will… I will use this variable. So, this first one will be
changed to PI.
Is there any… XI has read after ride hazards later on.
No, so you can just leave it, okay? Very good. And then… And then this… G… I…
C is okay, right? And then how about this one? This Y has…
Right? After either, so again, this should be changed. And this, I will use QI.
Anything left?
S… S3. S1, S3… There was,
Do I have to read, right?
Okay, but we didn't rename it at all, so you can keep it as it is.
Because you should keep it as is. So, read after ride hazards, you shouldn't
rename. Only rename when you have this.
this.
Okay
Do we have to worry about the same array, like, in this case, X, having this
correct value when the loop is finished? If we rename X… Oh, yeah, yeah, very good,
very good point. So, at the end of the loop, there are side effects of renaming,
right? So you need to taking care of that after. So it's…
like here, you have a 100 repetition. However, that cleanup code, we call it
cleanup, is just a couple of statements.
So, with that overhead, you still have a parallel… you make it parallel. There is
an overhead of a cleanup.
So, like, what does that cleanup look like in this case? If we're now writing to P
instead of X, do we have to copy the entirety of P into X?
Okay, so in GPU, what did I say? We use a virtual address… virtual register, right?
So this is a virtual register. At the end, physically, it will be the same as this
one.
Okay, it'll be taken care of.
Very good point.
Okay? A side effect should be taken care of. So that's why we use a virtual
register.
Okay, so let me go… let me clean up.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
what is different dependency. We studied true data dependency to figure out how
much compositions. Some of the most important anti-dependencies.
Kim, Eun J
Kim, Eun J
[Link]
Huh.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
One of the most important forms of dependent computation is the reoccurrence, as we
discussed before. A dot product is a perfect example of reoccurrence, as you can
see here.
This loop is not parallel, because it has a loop carrier dependency on the variable
sum.
However, we can transform it to a set of loops, one of which is completely
parallel, and the other partly parallel.
First loop will execute the completed parallel portion of this loop, like this.
Note that sum has been expanded from scalar into vector quantity, okay? We call it
scalar expansion, scalar expansion. This transformation makes this new loop
completely parallel. When we are done, we need to do reduce step, which is some of
the elements of vector, like this.
Let me… Use a pointer, okay, so here, like this one. We need to sum up everything.
Although this loop is not
Parallel. It has a very specific structure we call a reduction. Reductions are
common in linear algebra. They are also a key part of a primary parallel, primitive
in that reduce.
Reductions are sometimes handled by spatial hardware in a vector and SINDX
architecture that allows a reduced staff to be done much faster than it could be
done in scalar mode.
This works by implementing a technique similar to what can be done in
multiprocessor environment. While the general transformation works with any number
of processors, let's say here we have 10 processors.
As you can see here, In the first step of reducing the sum, each processor executes
this code.
Okay, so with the P processor, number of range will be 0 to 9, then what happens is
this loop can sum up up to 1,000 elements on each of 10 processors. It's completely
parallel.
A simple scalar loop can complete the summation of the last 10 sums.
So, similar approaches are using vector.
And as IMD processors.
Kim, Eun J
Kim, Eun J
[Link]
So reduction, actually, the importance of reduction is really huge in machine
learning applications.
Because in machine learning, nothing but what is it? Sigma of weight vector and
input vector, or the weight of earlier activation value, right? It's nothing but
you do sigma, then there are a lot of deductions going on.
And then the… in distributed system, like a GPU, now data size is so huge, we
cannot use one GPU, we use a multiple GPU, and then we call it data parallelism.
you divide the data, and then, again, your… the train gradients are partially
correct, right? You need to do reduction. So this is very important. So, in our
research group, we did, like a…
the earlier work.
Actually, IBM Blue Jen, you know the… the…
the parallel machine, they have a reduction network separate. So when… when they
have this
this kind of operation, they will use, they will do computation on the way. They,
instead of send all the sum value, they do this on the way they send, okay?
So, these kind of things, specialized and then very widely spread, use the
techniques to…
in the machine learning nowadays, even in the inference, they do have these kind of
things a lot. And then, look at that.
over one… One…
10,000 data, you do summation, right? But here, if you have P processor, you can
make P concurrent operations. Then there is a stride going on, right? You can do p-
value, and then next to 1,000, and then 2,000, like that. You can make it…
Parallel.
So these, we don't do in detail, but I believe the Vivex course will talk about
this parallel algorithm.
When we have a parallel machine, you need to come up with a parallel algorithm.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
While data-level parallelism is the easiest form of parallelism after instruction-
level parallelism from the programmer's perspective, it still has many fallacies
and bit of faults.
First, the fallacy.
GPUs suffer from being coprocessors? Well.
Although a split between main memory and GPU memory has this advantage, there are
advantages to being at distance from GPU.
GPUs have flexibility to change ISA. For example, PTX exists in part of… because of
the I.O. device nature of GPUs. This level of indirection between compiler and
hardware gives GPU architects much more flexibility than system processor
architects.
Second, pitfall. Concentrating on peak performance in vector architectures and
ignoring startup overhead?
Well, only vector processors, such as a TIASC,
had a long set of times. For some vector problems, vectors had to be longer than
100 for the vector code to be faster than scalar code.
Kim, Eun J
Kim, Eun J
[Link]
Well, now we don't need to worry about any Bitcoin machine learning applications we
use in GPU, they are having huge matrix size, right? So we, like the earlier, the
startup overhead, we don't worry about
Quickly, we can, you know, fully pipeline the system.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Increasing vector performance without compatible increases in scalar performance.
This imbalance was a problem on many early vector processors. Many of the early
vector processors had relatively slow scaleless units.
But good-scale performance keeps down the overhead costs, such as stripe, mining.
And it reduces the impact of Anders' Law. So despite the vector processor's high
peak performance, its low scalar performance makes it slower than a faster scalar
processor.
SM…
Kim, Eun J
Kim, Eun J
[Link]
NVIDIA did well.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
by the harmonic Man.
Fourth, fallacy. We can get good vector performance without providing memory
bandwidth. As we discussed earlier with the loofline model, memory bandwidth is
quite important to all SIMD architectures.
Lots.
Kim, Eun J
Kim, Eun J
[Link]
So nowadays, that's why HBM, right? HBM… HBM improved internal bandwidth, so they
are so successful. So all the AMD NVIDIA, where they use SIMD processor
development, they want… they… they want to work with, high…
Hynix, right? And then recently, the NVIDIA CEO,
Jensen. Jensen, talk with Samsung, and then, yeah, Samsung has a huge hope they're
gonna work with NVIDIA for the next generation to provide HBM. Let's see how it
goes.
So memory is very important.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Fallacy. On GPUs, just add more threads if you don't have enough memory
performance.
GPUs use many CUDA threads to hide the latency to main memory. If memory accesses
are scattered or not correlated among CUDA threads, the memory system will get
progressively slower in responding to each individual recast. Eventually, even many
threads will not cover the latency.
So for the more CUDA threads strategy to work, not only do you need lots of CUDA
threads, but also you need the CUDA threads themselves.
Must be well-behavior in terms of a locality of memory access.
Kim, Eun J
Kim, Eun J
[Link]
Because what? The NVIDIA GPU. High the memory latency?
Through what?
Contact switch, right? Scheduling. So you… if you have a lot of threads, whenever
you have a load, it takes time, you swap.
Right? Your contact switch with another thread block, and thread block. So
meantime, the first initial loaded vector data coming in, then you can switch to
computation, and then it will be beautifully pipelined.
That's the main idea of GPU.
Done! So, let's work on multiprocessor.
And, it used to be really…
hot topic, but now we… we tackle… we talk about GPU first, and all the enthusiasm
gone after this, right? But still, our main pro…
Any device, we still have a von Neumann GPU, so we need to learn CPU architecture,
too. So let's go back.
Do we have any questions on Chapter 4? We are ready to move on Chapter 5.
Let me, briefly talk about the rest of the semester schedule. So, we won't have a
class this Wednesday, because we were told.
And then, when is your reading day before final?
Ace? Okay.
So, 8th is a reading day.
Last day of class. So, I will see if I have to teach on 5th or not, on 3rd, okay?
Because I will save, some time to give you a review on this time.
So, if I can cover all today and two more classes, we won't have a class on 5th,
but if not, then I will have a class on 5th, okay?
And then we have only one more chapter, and then when is your final exam? 15th. Oh,
okay, so…
I will see if I can have a special office hour sometime…
But your final already started from 10th, right?
11th, so maybe 10th is a tentative day for the office hour.
So, but the thing is, your exam is too far away, you won't study, right?
If you bring some questions, I will cover on chat, okay? So this is my plan.
Any questions?
All right.
So… Yeah, finally, we are in the last chapter, chapter 5, thread-level parallelism,
MIMD, okay?
Let's look at the overview.
I don't want that.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
From now on, we will discuss Chapter 5, Thread Level Petalism. Actually, it's all
about right processes.
Kim, Eun J
Kim, Eun J
[Link]
It's the 28th.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Well, we don't have uniprocess architecture anymore. That is the fact of 2021.
Kim, Eun J
Kim, Eun J
[Link]
21.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
It has a green!
discussed that the uniprocessor architecture performance is near to end, and we
need to come up with a
multiprocessor, but actually it was a little bit premature at the time, 2000, until
early 2000. Now, what happened, slowdown in uniprocessor performance arising from
the, you know, the diminishing returns in exploiting instruction-level parallelism
because of a branch and, you know, the
adapted pipelining penalty is high, with growing concern over power. Power is the
main driving force. We have to
come up with different architecture. So, it's a new era in computer architecture.
So, multiprocessors play a major role from the raw end to high end. Every
architecture now, multiprocessors. So, this chapter, we will talk about mainstream
of architecture.
Here, we focus on exploiting thread-level parallelism, TLP. TLP implies the
existence of multiple program counters, and is exploited primarily through MIMD,
Multiple Instruction Multiple Data Model.
Although MINDs have been around for many, many decades, the movement of thread-
level parallelism to the across all range of computing from embedded applications
to high-end servers is relatively new, recently happened.
Another thing, our focus in here is multiprocessor, which we define as computers
consisting of tightly coupled processors whose coordination and uses are typically
controlled by a single operating system, and that share memory through a shared
address space.
Okay, so we are having shared memory multiprocessor system.
To take advantage of an MIMD multiprocessor with N processors, we must usually have
at least N threads, or processes. The definitional processes try to recall the
runtime program. The actual current alive program is called process.
So, multi-threading is presented in most multi-core chips today.
Usually, the number, the multithread is 2 or 4 times higher than number of chips
you're having.
Independent threads within a single process are typically identified by programmer.
Or created by operating system.
At the other extreme, the thread may consist of a few tens of iterations of a loop
generated by a parallel compiler, exploiting data parallelism in the loop.
Although, the amount of computation assigned to a thread called grain size, okay,
grain size, is important in considering how to exploit thread-level parallelism
efficiently.
The important qualitative distinction from instruction-level parallelism is that
thread-level parallelism is identified at high level by software system or
programmer.
And that the thread consists of hundreds of millions of instructions that may be
executed in Padler.
Shared memory multiprocessors pull into two different classes. First, SMP, the
other, DSM.
The first group, which I call symmetry, or in other words, shared memory
multiprocessors, SMPs, or centralized shared memory multiprocessors, each a small.
Kim, Eun J
Kim, Eun J
[Link]
So if I ask what is the S&P in your final…
If you put symmetric multiprocessors, correct, also another name, shared memory.
Processor, it's a, you know, it's the same name, okay?
I prefer sharing the memory. It explains the architecture better.
Symmetric, in terms of memory latency, is symmetric for every processor.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
small to moderate number of cores, typically 32 or fewer. For multiprocessors with
such small processor counts, it is possible for processors to share a single
centralized memory system that all processors have equal access to, so we call it
symmetry.
In multi-core chips, the memory is open-shared in a centralized fashion among the
cores. Most existing multi-cores are SMPs, so all chip multiprocessors you know so
far is all belong to SMP.
Classically, there is another category, okay? So, it's a DSM. Compared to DSM,
Distributed Shared Memory, SMP is sometimes called Uniform memory access, UMA, UMA
architecture, whereas DSM is called NUMA architecture, non-uniform memory access
time, okay? So, let me explain second category, DSM.
To support a larger number of processors, memory must be distributed among the
processors, rather than centralized.
If not, then the memory system would not be able to support the bandwidth demands
of a large number of processors without incurring excessive long excess latency.
So, as you can see in the bottom
figure, each core, they have their own private memory, okay? So there is a notion
of my memory and other memory.
So to access other memories, you need to go through the interconnection network. So
usually this interconnection is a more general network, so processors connect via
direct or non-direct, switched or multi-hop, non-direct. I don't go into detail the
classification of interconnection, although my main research area is in the
interconnection network, okay?
It will be general network. They are classified direct or indirect, depending on
the…
The how the core connects to the router, okay?
So you can think this way. SMP is a relatively small number of cores, usually
connects through bus, whereas DSP.
Kim, Eun J
Kim, Eun J
[Link]
Usually, usually, okay, it's not the answer, but usually when we talk about SMP,
like, this, like an 8 up to 8 cores, 16 cores, we use a bus.
So, the next quiz, not right after this.
We'll be in the final exam, okay? You really need to understand the cache coherence
protocol on the best phase, bus phase, because these are real protocols all the
systems nowadays use, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
SM is a large number of processors interconnected with general network, mesh, or
totals, like that.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so the next quiz is the simplest one.
Just to give an idea of parallelization.
Where is this?
So let's do this AAA. Oh, not edit.
Goodview.
So you can have a… you can…
like Anders' Law. Recall Anders' Law. Your program is two parts, sequential and
parallelization, okay?
And then, you get 80 times the speedup from 100, okay? So, how much… Parts can be
serialized.
Parallelization means you can have 100 times the speedup, isn't it?
But you're… Final overall speed up is 80, okay? It's Amber's rule.
Quickly do that.
I didn't… It's wonderful.
Anthony's listening.
I didn't understand that.
Today.
Thank you.
Oh my god.
Okay, so linear equations, right? You have two linear equations.
Your total execute… original execution time, let's say, is 100.
Okay?
it changed to S plus P divided by 100, because you have a 100 in… Processors,
right?
Any questions so far?
What's the speed up?
100… actually, this is 100.
Divided by this, right?
You can put this Do I need to solve it?
You can do it, right?
So 100 minus P. What is a P?
Did you find the P?
Okay, yes. It asks S, right? So let's not do that. So, S, leave, let me delete.
keep… S as it is.
So this change to 100, and the S… instead of P, that would be 100 minus S, right?
So, what is your S here?
You just solved this.
Quickly.
Any number you got?
What?
move on.
Yes, that's VRAM.
Anyway, this… you solved it, okay?
Shouldn't be that difficult.
How about the next one?
Suppose you have a 32… CPU multiprocessor systems, 2 GHz, 200… okay, so you have a…
memory remote, and then all local accesses heat memory hierarchy, and the base CPI
goal 0.5.
So, remote access actually translated to
400 class cycles, okay? Because of your CPI,
clock per instruction is a half, so it will be 400. What is the performance impact
if the
0.2% instructions involve remote access.
Which means 99.8 is local.
Only 2%, 0.2% goes true.
Remote.
So, new CPI. You already forgot the CPI… Formula.
Because the percent only given this, right? Except those, what is the baseline CPI?
What is a baseline CPI?
0.5, right?
99.8% is this, okay? Then… 0.2% additional, right? In addition to this, you have
Point to point to 2, you're heavy.
What?
400.
Correct?
only this time is remote, you have additional time. If it is local, it's half
cycle. You need to.
Okay, so new CPIEs, Can you give me number, quickly?
It's a 8, so 1.3, right?
Then,
So, it's… it asks a performance impact.
So, after all, you need to calculate speedup, right? So, what is a speedup?
So, is it getting faster or slower?
Slower, how slower?
Original was… because it is… it doesn't say anything else, only CPI change, right?
So, the… this is the new CPI, and then original.
Right?
So, this is a 2.6.
Like, couldn't… times slower, isn't it?
So if you want to use a… speed up the statement as it is, maybe you… you want to do
this.
So it'll be reciprocal of this times… Slower, right?
Or a pastor.
Oyster, oyster.
Yeah, speed up is this, much smaller than 1, so it's become slower.
Okay.
Can I move on? Any questions?
like any other method problem, if you do solve with your eyes, you won't get it,
okay? You use your hands.
Let me move on, if you don't have any questions.
Totally.
So what it shows here… In… do you remember the S&P? S&P versus…
what was it? DSM. Do you remember? Distributed Memory System. So, you have your CPU
and then your local memory, you have a CPU and local memory, and then you have
interconnection, and not just the one close bar, but there is a network.
And then you have many of them. So you… open time is a multi-haul, so your… having
remote data access is a killer of your performance.
Okay.
So, if you have a very…
good design of an interconnection, you can improve performance. So, one… one
architecture I want to show, you can look at the SMP, the earlier one, it was to
look like this way. Or you have CPUs, and then they are on… this is a cache, okay?
their cash, And then those are interconnected through a bus.
Okay?
These are baseline architecture we will handle from now on. And then, a lot of
time, you have a last level cache.
And then, off-chip memory, okay? So you have only one memory by this.
But if you look at GeonFi, like, the Intel server, like, 16 cores or 32, in case
you have an interview with the
Intel or AMD, okay, AMD 64 core, how they look like? They have a this way. So,
cores are, like, when you have a large number, no more bus.
Because it's not scalable. Why? Bus one at a time. So if a four, all four wants to
use a bus, you need to serialize.
Whereas if you have a general network, oftentimes it's regular, so you can have a
mesh or torus, but then when we talk about CNP chip, multiprocessor is on a chip,
you put multiple cores, chip.
by the nature, it's a plenary, right? You cannot… it's hard to have a higher
dimension topology. So mesh is 2D, whereas polar, there is one more
link, it breaks to the property, okay? So, a lot of time, Intel, AMD, the moderate
level of a solver design is this way.
However, what they have, they have a CPU and then small private cache?
Private cash?
Okay? Level 1. And then the last level kit is a banked This way.
What do you mean? Do you understand what is a bank memory?
If it is 16, what is it?
In the middle of…
how many bids you should use to identify which bank it should go. 4. So, 4 bids
will…
have one place to go. Can you see that? Okay, when I load the data, I look at my
own private cache, level one. If it is missed.
Then, I need to find where it is. And then, oh, this address says here. So I need
to send this miss request to this place, and then try to get it, okay? If it is
also missed, what? You need to go? Memory, okay, off-chain. So this is a typical
Intel, AMD, moderate server design with a large level, Number of cores, okay?
Just to pull your… Interview.
This is more… Practical design.
But if we do, like, personal device, cell phone, it's a bus-based. So, let's learn
bus-based cache coherence protocol first, and then we can apply.
So, hash clearance protocol.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we first talk about cache coherence problem happening in
shared memory architecture.
Remember, in earlier lecture, we discussed about two different architectures to
achieve thread-level parallelism, multiprocessor, one symmetric shared memory
architecture, the other, DSM. So, this cache coherence problem we talk about in
symmetric shared memory architecture.
Symmetric shared memory architecture, they will have a shared bus to connect
multiple processors within a single chip. So, assume in a single chip, there are
multiple processors interconnected through a common data… common bus.
Then, in those system caches, having either private data for only the private
processor, or there are shared data shared by multiple processors.
The problem here, when we have a shared data, we need to provide a coherent view.
Okay, what is a catch-coherent problem, then? Let's look at the example.
So, let's see here, we see 3 different processors share one memory, okay, they have
their own…
Kim, Eun J
Kim, Eun J
[Link]
This is a very typical example interviewer will ask you. Okay.
Because nowadays, we have all multiprocessor, right?
Coherence protocol… coherence problem is a key problem of microprocessor design.
So, try to understand this slide, okay?
Don't waste your time.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
On cash area.
Okay, so then, when you…
First, the processor 1 read the U-value, it reads from memory, because it's called
NIST.
And then… Processor 3 reads U-value, again, is from memory, u equal 5.
then what happens in processor 3? It updates U-value to 7.
Let's assume here we have a write-back policy. Do you recall when we talked about
cache architecture? There are two different policies we can have for writing.
One, write through. When you write, cache the data, update, cache the data, you
also update memory data. Then it's right through.
In this example, we assume right back. In writeback, what happened? You only update
cached data. When the block is replaced, you have a dirty bit indicating it has
been updated ever since it brought to the cache.
Then you write back to the memory, okay? So, when this third step happens, U has
been updated by processor 3, only update cached data.
Then, next time, P1 tries to read the U again, it would have a cache hit, right?
Now, what do you read in here? You equal 5, not 7.
Kim, Eun J
Kim, Eun J
[Link]
So processors may see different values for you after event 3.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
That happens in this example we show with the write-back policy, okay?
This is not acceptable, so we need to fix it.
Kim, Eun J
Kim, Eun J
[Link]
So, what happened with Right Thru, then?
So right through means you recall right back, right through, right?
When you update the cache value, although it's a heat, when you change a U value
from 5 to 7, what do you need to do? Through means that you need to go to memory,
and change this to 7.
If we use right-through, so you're…
danger of a wrong answer is, oh, with write-through, you don't have a cash
clearance problem, do you?
Get rid of a cache coherence problem? Still!
Still, P1 has all the value, right? So what are you going to do then?
flush P1's cache.
So, see, this is… Hint, bus. What is a bus?
Bus is a broadcast, right? I cannot have personal conversation, private
conversation. Whenever I talk to Anna, you can hear everything. So, what it means,
when you update the memory data, you from 5 to 7,
This bus… Allow other processors to snoop in.
Still listen.
Then what do you need to do?
You can invalidate. Invalidate means you pretend there is no copy. You erase, you
copy there.
The other?
You can update. Okay.
Alright, so in our class, we will learn invalidate, okay?
Between invalidate-based protocol, An update-based protocol.
Can you think about pros and cons?
So when you see, oh, someone else, okay, I'm sharing this, and then I keep looking
bus, okay, snooping, and then on a change value.
Value from 5 to 7. Okay, I see that, so I can guess 7 and update mine.
Or, oh, she's updating, so the… the data I have is stale, out of…
Okay, so I get rid of it. Okay, what's the plus and cons?
Invalidated traffic, more traffic?
Or it can be advantageous, right? Depending on… so you… Triwell, it's about
traffic, right?
So… If someone is writing, And then…
you invalidate versus update. What he said, you invalidate next time you need to
bring it again, if you want to read or write, right? So if there are two caring.
she's using that value updated one, I need to read her updated value all the time,
then maybe updated is better. Can you see that?
But that blog has been shared, but for a while, it'll be used by me, and then it'll
be used by her, because if we don't have really worked together on the same
variable, then invalidation is better, right? So depending on what type of
applications you are having.
Okay, but our goal, first of all, make sure we provide the corrective behavior,
okay? When someone updates… so.
Do we need to worry about cache coherence problem with the application-only reads?
There are parallel reads, a lot of read happening.
It doesn't matter if you have 1,000, 1 million views at the same time.
Because it doesn't change, right? That's… application you have for GPU.
We have a lot of… a lot of data, right? And then it can be shared, image
processing, image, you know, display, but they only read, then you don't need to
worry about coherence problem there.
Coherence is only about update, writing happen.
Okay.
Then, the next one…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So how to fix… let's assume we have write-through cache, okay? The write-through
means you are updating
value in cache and memory. So eventually, the other one should see the change
value. Okay, in that scenario.
Look at the code a little bit.
What this… may cause the problem. Although…
Kim, Eun J
Kim, Eun J
[Link]
So this is a very famous example. If you take a distributed operating system, this
will be
appear again. Again. And a lot of interview questions, they ask this.
So let's say we have perfect coherence protocol, which means whenever you update,
writing happens, writing will be updated, okay?
So, look at that.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So, you provide the coherence, okay? So what it means, when you have a write, that
write written value will be seen to another processor. I think you see this kind of
example a lot in operating system courses. They call it race condition, right?
Kim, Eun J
Kim, Eun J
[Link]
Race, condition. Race.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Do you assume when you print A here in processor 2, Since you get…
Kim, Eun J
Kim, Eun J
[Link]
So what do you assume? When you print A here, process of two sides, You're
spinning, right?
Until flag is set to.
1. You escape this while loop.
And then you are update… you are printing out with A.
But what do you assume here?
A will be?
Chiro?
it'll be zero, right? So, okay, it's an independent two processors, and we assume
we will
do sequential execution, right? So you update egg 1, and then you update flag 1,
right? Then by the time it escaped from this wire, spinning lock.
You… we know of flag 1… flag is set to 1, isn't it?
Then, with our human being.
assumption, right? When we could, we assume this will be sequentially executed,
right? So if flag is 1, A should be 1.
Okay? It should be 1, but then it won't be race problem, race condition, right? So
we call it its race condition. Why?
What happened?
You learn cash.
So, connect with Kath.
We execute stores out of order. So, store, do we… so when you write store.
In Tomasolo and Harder's speculation, remember, what do you do?
You just put it there, even…
Hardware speculation. Commit buffer. That's it, right? Processor could move on to
the next instruction. Can you see that?
So, what if? What if?
Egoan is Cashmis.
It takes a long time.
Okay?
It was in the commit buffer.
But then, You go to the next one.
So what do you do? You do store instruction there, right?
So you still keep it, right, in the… in the ROB buffer until it becomes in order.
However, there is a status where
A has not been updated, but flag is, you know, hit in the cache, quick, so you
change to 1.
then when you go there, and again, this is A variable is a far away memory, it
takes a long time, or
Because even you do cache coherence, this… Upstate takes a long time.
But there are time differences, right? Right actions are not atomic.
Atomic means A… until A got 1 done, you won't go to the next one. We won't do it,
right?
will allow keep issue next and next, right? So, meantime, flag can be 1, whereas A
goes 0.
That's a problem.
So even we provide cash equivalence protocol. So when we change A goal 1, everybody
should see AGO1, but we didn't say how long it takes. Can you see that?
Right?
It doesn't say.
Okay?
So… We name this, it's memory consistent job, so it's different from
cash clearance problem. So, then, if your interviewer asks how they're different,
you give an example, and…
this definition coming, it will be clearly mathematically defined how they are
different, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Out of a while loop, your flag is 1, right? The value is 1, which means there are
only two processors, so flag value has been updated 1 here, so we assume in order
execution, right? And then, when flag is 1, A should be 1. So when you print out A
value, you expect A should be 1.
However, A lot of systems, multiprocessor, this intuition doesn't work. You may
have A goes still 0.
So, we expect the memory to respect order between accesses to different locations.
Like, when you have A go 1,
flag week 1, you expect that these two are in order, right? So if you see flag week
1, you should see egg-1.
Actually, no, right? We allow what? Out of order execution. So what if flag equals
1, flag variable is in the cache hit, right? And then you update to 1, while A is
missed, so you travel to the
memory, it takes much longer time. So when processor 2 executes this while loop,
flag has been changed to 1, so you go to print A, right? But A has not been
updated, because this is on the way of updating. Can you see that? The first
writing can take much longer than the second write.
Okay? So this is not coherence problem. We name it consistency problem, okay?
So, when we have a multiprocessor shared memory, we have an intuitive memory model.
When you write in sequence, you are expected that write happens in order, and when
someone else is writing, you should be able to see updated value, right? I show two
different problems in this
intuition doesn't work. Okay, first, we call it coherence problem. Second is a
consistency problem. How to differentiate these two is… it's very confusing, right?
Okay.
Coherence only talk about values for one location. So if you update one location,
that same location when you read, you should read updated value.
So it's relatively easy to provide a coherence view. That's the hardware. We, from
now on, we will talk about the protocol. We will have a protocol to make sure every
processor will have a coherent view of cached data.
Okay? So, values in one location. If writing happens, you will see the written
value, updated value. However, consistency always associated with two more places.
When you have two different locations, the relative order
you think it will follow intuition. When we write code, we assume it's in order,
but sometimes the underlying interconnection and memory is far away, it takes
different time, so it doesn't work like we expect. That is a memory consistency
problem. Mainly, memory consistency problem is tackled in operating system level.
They try to catch this race condition, and we will talk about it later.
So let me reiterate.
coherence problem and the consistency problem over memory in multiprocessor
systems, okay? When multiple processors read-write.
First, coherence means all reads by any processor must return read the most
recently updated value from another.
processor. Rides to the same location by any two different processors are seeing
the same order, okay? So, like, A, C, A, B order, and then the other processor see
BA order, no. Any ride should have, between rides, you need to have a total
ordering.
Whereas the consistency problem is when you have written value, will be written by
a read, okay? When you have a one written value, it will be read. However, if a
processor writes location A followed by location B, right, any processor that sees
the new value of B must also see the new value of A. So you are…
Kim, Eun J
Kim, Eun J
[Link]
So, consistency always talk about two different locations, two variables, whereas
con… coherence based on only one location. If you update one other based on the
same location, you should be able to see the updated value, okay?
Consistency is about relative ordering, okay?
It's very popular problem wherever you go as a system designer for operatic system,
computer architecture, compiler-related work, this is a very important concept,
okay?
I will be back on this After Thanksgiving? Happy Thanksgiving.
I think that it's a… not to have today. That was, like, the reason they came up.
Like, if you want to program something to S5 and write that, it's not gonna work.
It has to commit after the cycle. So it will write into the careful… I mean… Wait a
second.
Dec 1:
Yeah, that, that won't give you the,
But you won't find the bug that way.
Got the… That's why you're so stressed, because, yeah.
Yeah.
We have a lot of design.
Yeah, okay, otherwise… Oh, wow, or Ben. That's awesome.
And it's not even old, it's a completely new apartment. Like, as moved in.
So you're looking for the leased apartments.
What was your rate?
You should try.
international students. My family, they're not a huge fan of Turkey, but I'm… I
like… It's the…
It's a holiday, he's… seasonal food
Okay, so let's talk about the rest of the semester schedule. When is our mid-final
exam?
15, okay. And, we will have at least 3 more classes.
And sometime… When is your reading day?
This is 9th, right?
reading day, the day before final begins, we are not supposed to have a class or
9th, right?
I can't have a special office hour 12, From 5 to 6.
When you don't have any finals going on.
That's my plan. It will be online.
And,
If you want, I can reserve a conference room that we can have a hybrid office
hours, but then the people in the room cannot hear the questions well. So let's do
all online, okay?
But what I said… 19th?
December 19th, 5 p.m.
For one hour only, I would have office hours.
And, don't come without any question.
A lot of you come just to listen.
other students' questions and answers. But at least you need to prepare one class,
okay?
I think there will be more, you know, effective
Then, having office hour on 9th.
Because, I don't think you will start to study 614.
Knight, because you wish for other supplements, right?
And today, and Wednesday, I will see if, I'm happy with the…
case, then I won't have a class on Friday. I will let you know on Wednesday.
Well, thank you.
13.
No, it's now 15. So where did you get the 16th? The officer is on 19th.
9… no, no, no, no, 12.
Well… Oh, okay, so final…
No, you didn't collect midterm, right? So, usually I pick one or two questions from
midterm, but I won't, okay? So, your final exclusive on the materials I cover right
after midterm, right?
But the questions can be related when I talk about cash miss and, you know…
Are you gonna do a final review of new files? That's what I'm planning to do.
And then I would have a special office hour on 12, here. I will put on the
announcement, December 12th, from 9 to 6 p.m, or 9.
There is June, don't come to my office. We'll have an office hour through… 10.
A lot of times asked.
right after final, last office hour, I… Finalize, final questions.
So sometimes, if I like your question, I put that question.
Or, a lot of students ask, I'm accustomed.
And that's a good question, right?
Yes.
So today, the quiz, we're gonna do quiz 31. Definitely, it'll be in the final,
okay? But,
the… the setups and the way, it can be different, okay? So be careful.
Again, you… you knew French predictor will come. You knew Tomasolo, how does
speculation come, but you missed those questions, right? Why?
Why?
Think about it. If you didn't ace your midterm.
it's not about prediction, which questions. I let you know. You need to know, like,
SIMDs, convey chimes, and speed up, and lead lines. You know, there are important
things of… a few important things, and definitely I will ask you.
About those questions, right?
So, try to understand Not only how.
Okay, the first thing, the questions we go over together. Do it?
Couple times, and then change assumptions, and between you guys, you can change
assumptions and do it together, and just, you know, compare the…
answers. Any one small thing wrong, you really need to dig in why you got wrong
numbers, right? There should be a reason you miss… you have a misconcept.
Those will be really huge in your final, okay?
Right?
Okay, so it's a very fun subject, and most important things. Anywhere you go.
the CPU design, nowadays, all CPUs are…
Multiprocessors, right? The cache coherence will be…
Okay, and this really, progression.
So that I can find someone who has various…
You know, some understanding, but very surface.
You can differentiate who knows really well or not, okay?
So that's, so from… today's example.
I always put yourself and someone, and we try to read the, you know, data together
and write together. You really need to simulate it, okay? So that you understand
why I need to send the validations, you know, why I need to send it to
Hold that protocol, okay?
Unless you put yourself… I…
Because I'm honest, I had a struggle, okay? I… I couldn't have a 100% understanding
of this protocol.
8. Total.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we first talk about.
Kim, Eun J
Kim, Eun J
[Link]
A cash cowork.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
appearance.
Kim, Eun J
Kim, Eun J
[Link]
Do you remember?
Oh, we are almost here. Okay, so let me briefly go over…
The earlier one, so that we can have finished connections.
Because Thanksgiving is a long break, right? So, when we have a cat, this is all
about shared memory, okay?
So, cache coherence protocol, if you recall, there are two ways of building
multiprocessor systems. One, SMP, shared memory system, symmetric memory, we call
symmetric memory or shared memory.
The other? DSM. What is a DSM?
Distributed and shared memory, okay? So you have a notion of a private and… remote
memory.
There are actually… in architecture-wise, there are two basic different, but then
when we use the… in the programming model, there are two ways of using this shared
memory. One, through
Shared memory, okay? So, you communicate with another thread, another processor,
through shared variable, okay?
This is all about this, okay? Shared variable. When we have a shared memory, and we
need to communicate through shared data.
And what do you need to do.
The other open time that, protocol, that libraries use, open time in DSM, okay,
Distributed Memory System.
is through message passing. Okay, so when I want to communicate with another
process in another processor. You… you do send the receive, okay?
The synchronization done with the message passing.
So, there are two different libraries, and it is orthogonal to the underlying
architecture.
But, you know, most likely, the message passing is implemented in DSM, whereas
shared memory…
the data shared through shared, the memory shared, the variable is through TS… SNP,
okay? Shared memory processor.
So, here, the big picture, we have a shared memory, one big memory, and then there
are multiple processors, they have their own half.
Yeah.
And then we allow multiple copies. That's a main headache, okay?
How we maintain coherence between multiple copies. That's a coherence protocol.
Okay? When we talk about this.
And then you remember this example, right? There are two variables here, and while
P1 update, P3 cannot see, then you need to have a protocol. That's a cash clearance
protocol we're going to talk about today, okay?
Remember, these things, when you see the example, is only one variable, one
location. If you update
That one location variable, one data.
Next, read, no matter from where, which processor pre-read, you should be able to
see updating that, okay? That's the cache coherence.
When it does this, okay, we have two variables, so the…
For example, when you have an interview, this kind of thing is coming up. If your
example has a true variable, it's no longer cache coherence protocol. It's no
longer cache coherence problem. Even you provide cache coherence. Let's say you
change A, okay?
A value to 1, when you print to A, you think, oh, it's enough time, this should be
1. That is, if you have only one variable, that's coherence protocol. But this
Look at this. It's not about A go 1 or 0. It is a timing between relative order
with another variable.
When we see Y flag, you do spinning, spinning weight. A lot of times, the
synchronization in distributed systems, we do this way, okay?
We do this, and then we thought, as a programmer, oh, this should work, right? When
flag equals 0, we are waiting. When flag equals 1, oh, so this is in order, right?
Even we have, out of all the execution, what do we have? The order buffer, right?
We… we know all this should be in order, but that is in P1. We never talk about
what happened to P2 side, okay?
That… is in local… Maybe arrive sooner, but then… A is much longer.
long distance value, then you will… you won't see the update value. It is a true
variable, okay? That is a race condition that should be covered by
consistency model, okay? If we had executed loads in order with stores in our out-
of-order processor, would this consistency still be an issue if every load and
store was not reordered with respect to every other load and store?
You can think, like, a second. Let's not talk about load, okay? Because a load… we
need to have data in my side so that we can move to the next slide, right? But
still, you each score, and that's it, right? We don't… this process won't wait.
Until a has been updated, right? We sent it, because Snow is not on the critical
path, whereas this…
P3, P2, the other.
You're reading, right?
You're reading. So, but a secondary reading, you don't care whether this is a new
value or old value, you know what I mean? As long as you read, you go to the next
aisle.
You low, you focus on it. That's the innovative, right?
So, even you will learn query protocols. We want to make sure when this ELEGO1
update.
Oh, yeah, I spoke a couple minutes. Oh, when I update, you should know, okay?
I will update one, and I will put the next one. And I will update the flagged one,
you should know, I will know. But we never say how long it's gonna take.
Okay, A1 is nearby to honor. Anna tells us the end of it, and then she reads it.
3 to 8, right?
So there is a relativity, because we have a distributed memory. Memory isn't set
up, we don't have any control on writing, atomic writing.
Okay, so I don't cover consistency problem, okay? If you are taking distributed
operating system and… or gradient level operating system, this should be big
chapter in operating system.
Is the operating system problem.
So, to give you, some idea, I don't know if this slide has, detail. So…
We can't provide Very strong memory consistency pla…
consistency model. Okay, we have a model.
What we do, every ride, we give a total order.
means, when you have A1,
You go… you don't go to the next line until, like, 1. A1 is done. You need to have
atomic operation. There is just no such way why A is updating
Flag 1 can be updated first.
Okay. You, your hardware… is… provide exclusive access to youth sports line, then…
What's the consequence you will have?
Performance Secretary.
So, in hardware, we don't want to do it. So, if I share quickly the trend of a
memory consistent problem in hardware, we do have a speculative…
the operations, right? We don't have a strong order. We do break order, okay? Then
what if… what if you really need to have print A equal 1 when flat equals 0? Then
you won't use it this way. What do you use? Do you learn
monitor, okay, thin lock. There, there is a synchronization and hardware support,
okay? You should use the barrier. Those is explicit tool for synchronization.
You cannot have implicit over when you have synchronization. The ordering you
described, where you wait for A before moving on to the flag.
that is… is that only ordering of loads in stores? Like, if we have an addition
between these, that can go out of order with the load in the store, if it's only
registers.
Even… even we, in, in harder, harder speculation.
we don't wait until this egg goes 1, done, done, done, and then go two together. We
don't… we never do that.
We should then, you know, throw every cash.
And then, when it is reached to the Arabic government.
As you go to the network.
Because, indeed.
Put that aside, it is in order. You, you have a card value in your domain, register
values, everything. But we don't care, we cannot prove… We can, okay? We can.
We… we hold… we provide the atomic ride, which means until flag one.
What is happening, I change flag one, attach it, I changed, and then make sure
everywhere else flag is one. If it is updated, you need to have flag one. Memory,
you should have one. You need to wait until everything's done, and you have
acknowledged from everywhere.
There you go to the left behind.
Not slow. That's too slow, right? So in, in…
So what do you impact the hardware level.
We break that consistent principle. So you will learn in the operating system,
there is strongest one in the systemic culture, and the weak model, I forgot the
three different consistent models we learn.
And, so…
We relax the conditions, we break the assumptions on writing atomically happen, and
we… oh, yeah, we will have only partial ordering, not total ordering.
So, you can imagine there is the instructions there and the instructions. If there
is a two-processor.
If you… the users think each
instruction is in order, right? But we never know the order between these two.
Consistent model. We've never… so you can imagine there is a two-stack of cars.
You don't change the order in each concept, but you can merge them.
Okay, you merge, there is one order, right? This order, seen by both the processor
1 and 2, they agree on this order, then it's a swap, okay? But that really degrades
the performance.
In hardware, we don't want to do it unless there is a really need for
synchronization. Then what we say, if you need a synchronization, you will do the
split synchronization. Put the barrier, put the signal for…
monitor. And if we measure it closer, then we do a quantitative way up out of the
other.
All right.
So, hopefully, you… Differentiate memory, so this is it?
So we talk about the difference between coherence and consistency.
And then, I think I need to, start from here. Basic schemes for… Okay.
Memories are really good.
Who knows?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Any processor that sees the new value of B must also see the new value of A. So you
assume that in-order writing happens, okay?
But a lot of us, real systems, it's hard to provide this consistency. Yes, if we
don't worry about performance, we can provide a strong consistency model, but it
hurts performance if we break it, as long as it works, okay? Because a lot of the
time, the software level, they make sure if they really need to synchronize and
consistency value, they put barrier, and they use the software library, okay? So in
hardware, we can…
provide a little bit of a relaxed model. That's the whole point of a consistency
model. We try to break as long as
It can be seen as a correct behavior of a program.
So from now on, let's focus only coherence problem first, okay?
So…
The thing is, programs on multiple processors will normally have multiple copies of
the same data in their own caches, okay? That's the root of the problem.
Now, instead of fun.
not allowing sharing multiple copies. We provide multiple copies to speed up.
However, we will come up with a hardware protocol, okay, whenever you write,
whenever you read, you should follow some protocols to maintain coherent view of
the caches.
So, we provide migration and replication.
Okay, to improve performance of shared data. So if you don't allow replication,
multiple copies, we don't need to worry about coherence view, right? Because there
is only one copy, writing happens only in one place, so everybody will see that
written value, right? But you have multiple copies, we allow replication so that we
need to make sure the protocol follows so that other
Later re-read that, you know, updated value.
Kim, Eun J
Kim, Eun J
[Link]
So, do you see migration versus duplication? So, let's say there are two
processors.
And, both of them need,
Common variable, okay. Migration means whenever I read, I put that in only my
cache, exclusive. And what I need is migrated per cache.
So… Since you have only one copy, there is no coherence problem. Easy to handle.
Duplication!
I allowed… Anna also has a copy.
Well, this will be good.
We have both computers.
So… Yes, we have a shared merit because we work together, right?
So, can you think about more on operations? What kind of operations do you have?
Onder. Para une?
He's right, right? So, replication will be beneficial when?
When you have We estimate time.
It doesn't hurt having multiple copies, isn't it?
How about both of us keep writing, and we are supposed to?
have updated Rifle.
All the time. Then what happens?
routing.
Duplication is really difficult, right? You have a UCOPY, I have… I change from 1
to 2. You should see 2, right?
Then I will let you know if we… because we have this forecast method, I… Anna, I
changed to 2, you also need to change 2. Please update.
update-based support call. Or, I'm going to write on a utility, okay?
Okay, so I can change from 2, 3, blah, blah, 10 at the end that when Anna wants to
read, then she can read the text, instead of… she doesn't need to see all the
changes, can you see that?
So, update versus… The other version was an invalidate, we discussed, right?
invalidate only, you invalidate once, then update can happen many, many times by
another process. You don't need to care about that. Only when it's time to read for
you, then you got updated. As long as you got updated value, that is fine. Can you
see that?
So let's go through this protocol one more time. Anna has 1, the S equal 1, I have
S equal 1.
then somehow I run, okay? Oh, I change S equal 2.
then I will let her know
Update version is this way. Whenever I update, I shout at her.
S equal 2, and she copied, S equal 2.
3. She copied. Can you see that? This is updated version.
Way after my update, S equals 3, 10, 10,
Finally, she needs to read that S value. How many times did she keep updating by
herself?
Versus…
validate, invalidate protocol. She has 1, I has one, then when I update it, go 1,
2, I let her know, I shout at her, invalidate.
I'm going to change. Invalidate. She invalidate, okay?
Then whenever I change S value from 2 to 3, 3 to 4, 4 to 5,
She could listen, but she doesn't have to do anything, because she doesn't have a
copy.
It's not your business, right? I don't have a copy, it's not my business.
Then I change S equals P. Then, oh, I need to read the S. She look at her cache.
She doesn't have it because she invalidated before, right? Then she will put what?
Mystery cast.
Okay? Oh, I need to have this S value, okay? Then, I do snooping, I… I…
Share all the keywords. Snooping, okay, snooping means you keep listening. She
said, she shouted, okay, I need S.
Then I will, good citizen, I look at my cash. Is there any S? Oh, I have an S
block.
Oh, you need for read? Okay, you, you can read, I said, okay? That's a protocol.
All right.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So, replication is really good for read, right? So, because read doesn't change
value, if you have N replicas, you can provide n simultaneous reads. So this is the
main reason of aligning replication. Migration, okay, migration, we…
move data to local and use there in transparent way. This will reduce both the
latency to access shared data.
Okay, so if you…
your data is far away, it takes a longer time, right? So once you have missed in
the nearby cache, you brought from far away, you set… you leave it in the nearby
migration allowed, okay, so that you can achieve higher performance here. So two
things
We have, that's why we need to come up with a coherent review.
So in the world, there are only two classes of cache coherence protocol. One,
directory-based. So there is a centralized one directory.
keep track of all shared ring status of a block, then you would remember cache
block, right? Per cache, you keep track of a status. Which processor has it, which
process is most up-to-date, they copy like that. So everyone needs to read, you go
to directory first.
Okay? The other is a snooping. We will talk about snooping cache coordinates
protocol first, because it's easier, okay? Here.
you have come and broadcast bus, okay, all the caches are interconnected into bus
so that everybody can see what's going on. So you keep snooping, right? So all
cache controllers monitor, or what we call snoop, what's going on. And if all
writing happens.
I will update my own copy. Do you know what I mean? So here, in the class, it's all
broadcast, right? So even I want to talk to someone privately, but everybody can
listen, right? And then, you know, oh, EJ said there will be no
Queen is tomorrow, then you know, right? So that's the snooping.
So we will talk about snooping-based cache covalence protocol first.
Kim, Eun J
Kim, Eun J
[Link]
We're here.
Snooping!
Is when we have a common communication media, like the bus and the one switch, and
everybody can see, alright?
But then, if you don't have a, like, a mesh or, like, in Intel, they have a Taurus,
they have a different gender interconnection, and pointed to point. So the
communication between these two, you never see that, right? So.
We do changes on caches, you never see. Then what we do, we have a directory, okay?
There are central positions. If traffic of every cache block.
Who has the most up-to-date copy, or who has a reader
Can you see that? So every time you need some block, you come to directory,
centralized directory, and see if there is a block, okay? And the updated block is
owned by Noel. Then you go to Noel and get that block, okay?
Can you see that? It's just the directory is a centralized one.
Let's move to the next slide.
And then, when you hear this…
maybe you want to have an empty paper or a note through protocol. Protocol means,
you know, finite state machine. We, like.
Do you remember, Francis?
Table, branch history table.
Always, when we go over finance state machine, you want to draw, and each
transition, if you remember, you understand why transition happens, it was easy,
right? So let's do that, okay? So we will start with the drive-through protocol
first, each one. So you need to have an entry table.
Sometimes in undergraduate level course, I ask students to draw and collect them,
give
Excellent credit, okay, because I really think that is important. You think it's a
very stupid activity, right? I let you draw by yourself? That is important, and I
don't believe you will do it by yourself. That's why I force you to do it here.
So, the students review rate by process.
They… they complain. You know what? EJ ask you to do things, and then she always
check whether you… like, I walk around and check you really do…
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This set of slides, we're gonna study.
Kim, Eun J
Kim, Eun J
[Link]
when I… That's it to do, that's important.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
the Snoopy Cache Coherence Protocol. This is a very important topic. Most of CMP,
where you have shared cache, they have a Snoopy-based cache Coherence protocol.
Kim, Eun J
Kim, Eun J
[Link]
If you have a color depend, use color dependent.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
rely on shared medium, either bus or switch, where you can see all the
communication happening between processors and memory.
So, you remember, for cache block, you have data and address, right? State bits are
important. Here, we learned valid bit.
which indicates whether there is a data or not, right? And then when you have a
write back policy, you will have another bit, 30-bit, updated or not, right? So,
this is all about the state bit, what kind of state bit we're gonna add with the
cache coherence protocol. So, here, what it works. Every cache controller snoops
all transactions on the bus.
Kim, Eun J
Kim, Eun J
[Link]
Any relevant transactions, you see there is a miss in another.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
processor, the block you have, then you take action, okay? So you take action to
ensure coherence. Either invalidate your own copy, or you update, or supply the
value you have.
So, when we go over SNOOP cache Coherent protocol, always put stuff URL in to one
of our processors, and what's going on? That's the protocol.
So, either you get exclusive access before write, or you invalidate every write,
okay? So, there are two policies we can take. Either invalidate or update of all
copies.
So I just explained, there are two policies we can take, either invalidation or
update. Remember, this coherence always
associated with write, right? So you need to recall there are two write policies in
terms of cache. One, write through, the other, write back, okay? So you will have
four different combinations. You can have either write-through or write back as a
write policy, and then as a SNOP protocol policy, you can have invalidate or
update.
So here, this example is write-through with invalidate.
Okay? Keep in that mind. Let me show you the example.
So here, let's say, You are reading U5 from processor 1 first, and then processor
3.
Kim, Eun J
Kim, Eun J
[Link]
Okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
And then, processor 3 tried to update.
Here, you're having write-through. What happened is, when you write to the cache
the data.
Kim, Eun J
Kim, Eun J
[Link]
Right through means what?
Basically, this blue arrow goes the P1 node, right? Right through means this U-
value update will happen to the memory too, right? So this transaction actually can
be observed, snooped by processor one. That's the whole point.
Can you see that?
Through… because it's a write-through, you need to update memory too, and then that
will go through the bus, and bus will be monitored by all the processors.
So, these other two processors keep listening, keep monitoring, and then U5, and
then their cache index, right?
the cache block number will be on the bus. Okay, I'm updating this with the updated
block, and that can be 5P1, let's say I'm the P1. Then what I do, I do tag
matching. I get the block address and check my cache
text if I do have a hit. Hit means I have a copy. -Oh, right?
This, I have a block, U5, but then that will be updated.
Then what do we do? We do… this is invalidate, right? So, if I have a copy, that
block, hit, means, do you remember?
in the cache block, with the first, you know, metadata was whether it's valid or
invalid. Can you see that?
Hit means it's a V, right?
Valid equal 1. Now, you see someone else is riding. Through bus.
then you… what do you need to do? Just the B change to I, isn't it?
Okay, we talk about one action.
How many different actions we will have in total?
The notion here is my action, or you see something going on on the bus by other
processors. Can you see that?
There are only two things, by me or others.
In terms of?
You know, cache operation.
Is it correct or not?
So then, how many different kinds of operations do you have on the one cache block?
There are only two. Read or write.
Okay?
So, let's go.
I have… My reading requests from my side.
Okay, and then, of course, I will check on my cash, right?
If it is a hit, what happens?
I'm trying to read. It's a hit.
Read.
That's it.
Does change any status?
Nope. Okay?
Then… so we are done with one thing, among four different combinations. And second
is my right, right? I'm writing.
So let's talk about status B, okay?
I think I will explain with the figure. That will be better. Then you would draw
together.
So in terms of status per each block, there are only two statuses, whether it's V,
he, or I.
Okay? And then recast, there are only 4 recasts.
From me, read or write. From others, read, write.
And then… then each state, you, you have a go-through of all 5 cases.
Each is… so you handle 8 different…
situations, that's all. You can complete Korea's photo.
Do you want to try?
Or you want to do it again.
Yeah, that's a fan, that's so good.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Step 3, what you need to do, if it is invalidate-based protocol, you invalidate all
other copies. So, you are the P3. And then, you know there are… if there are other
processes that same copy, you should let them know. Invalidate, you value, right?
You block. So, before you write, you will invalidate this copy.
So, write update uses more broadcasting medium size of our bandwidth.
copies in other processors. When other processors try to read it later, they will
have a miss.
Which means they need to go to memory. Right through, you already have updated
value in the memory, so any processor later accessed to the same block will read
the updated copy.
To provide this kind of Cache covalence protocol, we need to have architectural
building blocks.
So, first of all, we need to build a finite state machine, okay? So, finite state
machine will be deal about how each cache block state will change upon which
action. So, either it's invalid, valid, or dirty, you remember that those are cache
only, but then we will add more
State, if it is necessary.
Again, this is not P-based cache coherence.
Heavily rely on broadcast medium, means this bus provided any serial
ordering point. Everybody see the same thing at the same time, okay? Each processor
won't be able to have a different order of rights.
This is a very fundamental system abstract. Without a common medium, we cannot have
a snoopy cache coherence protocol. And actually, relatively moderate chip
multiprocessor, they interconnect multiple chips through bus, so it works.
So you can take… this is a logical single set of wires connect several devices,
only single set. So protocol will be arbitration when you try to use bus, it will
have arbitration, and this protocol, but for bus, you will have a command address,
you need to tell which processor, and then you will have the data.
Every device observes every transaction, that's the most important thing. So it
doesn't have to be only bus, it can be a switch, as long as every device attached
to this medium can listen.
Everything going on.
Broadcasting medium enforces serialization of read or write access, especially
write serialization. So first, the processor to get medium invalidates others'
copies, which implies
it cannot complete it right until it obtains bus, okay? So you need to have a bus
first, so that you can have an order.
All coherence schemes require serializing accesses to the same block. So for the
same block, you need to have a total ordering by having one access point. Here, one
access point is a bus.
We also need to find other…
Kim, Eun J
Kim, Eun J
[Link]
It doesn't say you will have, serialization access to every data, okay? Per each
cash flow, you need to have one point, okay?
So we don't care about different blocks. We don't provide global ordering, okay?
That's why we don't provide consistency.
With this only provide coherence.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
up-to-date copy of a cache block. So, maybe in the one cache block, you may need to
record where is the data to generalize this protocol more.
Let's talk about how to locate an up-to-date copy of data.
We need to think of two scenarios. One, with write-through policy, the other, write
back. Write-through, you know, up-to-date copies are always available in memory,
right? So, you will just let it go to memory when you read.
However, write back is harder, because write back, most of the updated copy can be
one-off cache.
So…
Let's think about a cache mishap. There is one processor looking for a cache block,
which is miss. Then you will put miss request to the
bus, right? Then what we can do, we can use the same snooping mechanism. This Mr.
Recaster on a bus can be here, any other processors. So, if
you look at the address, and then you check with your cache, -oh, I have an up-to-
date copy, then what you can do, you can supply the up-to-date copy to the bus.
So, you, you see, let's go through one by one. If right through, you don't need to
have this, right? You just required, even though there is a misrecast, miserycast
will be forwarded to the memory, and nothing happens. However, if you are having
write… try to write, then you should know there is another write read going on,
right? So anyway, you need to snoop what's going on. The write-back is saying, you
have a most recent copy, and then you listen, and then, oh, the…
Data block they are looking for.
is what I have. Then I… you need to supply that up-to-date copy, because memory
doesn't have
However, it's not that simple why processor is not snooping. Remember, the snooping
action happens by cache controller, right? CPU will do other things. So.
Cache controller even is located up-to-date copy in my cache. What you need to do,
you need to get that block from cache. So that retrieving cache block from a
processor cache can be complicated, and it can take a longer time, okay?
So, either we will go for write-through or write-back. However, you can see right
back, we already discussed in the cache architecture, write-back means a much lower
memory bandwidth compared to write-through, right? So, it can support much larger
numbers of processors. So, most of the multiprocessors use write-back, okay? So we
need…
Kim, Eun J
Kim, Eun J
[Link]
So, our goal, the end of this week, we want to learn right back-based the Snoopy
protocol.
Okay, invalidate. Write back, invalidate. But we will start with write-through
invalidate, because write-through is a simple why. Your memory always has solution,
right? The up-to-date card. The only thing you, when you miss, you get it from
memory, that's it.
Okay? The other processor, even you have a hit, or you snooped, and then you
compare, oh, I have the copy, you don't need to do anything, because memory has
copy.
Right? So protocol is very simple.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to learn Snoopy protocol with write-back policy, but we will start with a simpler
one. Before looking at snoopy-based cache coherence protocol, or write-through,
which is most simple, let's discuss about how to implement snooping cache coherence
protocol.
Okay, so if you recall branch predictor, branch predictor, when you do the… in the
exam, it's just a table form, right? But with certain parts of address, you have
the
table entries, actually, it follows finite state machine, right? So you can think
each… for each address, it has a separate controller, finite state controller, keep
track of a state. It's the same thing. Tsunobi cache correlator is usually
implemented by incorporating a finite state controller in each cache
controller of one processor node. Logically, we can think a separate controller
associated with each cache block, okay? For each cache block address, you will have
a 1 and 3, okay? Number of bits indicate certain states, right? And a certain state
will move around based on the actions, okay? We will learn that.
So, therefore, snooping operations or cache locals for different blocks can proceed
independently, okay?
You will get that example.
Although, when we go through this example, just keep in mind that each operation is
one complete cycle, we don't do interleaved, but actual implementation, single
controller allows multiple operations to distinct blocks to proceed, okay?
So, but, for now, you would just make it simplified, and then let's go through the
example.
Kim, Eun J
Kim, Eun J
[Link]
Yeah, so, okay, here, what it means…
your open time here, good here, versus split it, okay? So, when we,
let's say, put the miss, or get the invalidate, whatever we may use bus. Until that
action is done, we hold the bus, okay?
Although, maybe you go through the bus, and then those actions happen in the
memory, you still hold the bus. Why? We want to provide a serial point,
serialization point, through bus.
But that will kill the performance, right?
Because your bus is not used now, but still you're holding, you're blocked, right?
So we do split, okay? Once you use the bus, and then you release it, other
processor can use it.
Then you break that atomical price.
For this class, let's do a simple thing, atomic operation first, okay? You hold the
bus until you are done, okay? This protocol, one arrow, one action is done, you
hold the bus. There will be no interleave between different actions, okay?
Because why I'm talking about this, I'm kind of afraid that during your interview,
you never think about splitting bus, and then, you know, oh, this class will be
never disregate.
Actual things, and we release bus, okay?
You said the consistency is done by the software side operating or system side. So
in this case, in the real systems, when this is split, how exactly do we program?
Because I think semophores are, like, you don't do multi-core programming, right?
So, I don't want to go to that another… okay, do you remember when we have a while,
and then flag it goes zero, you're waiting, and then print A, right? What it means,
you want to have this order, right? Then you… you need to put the semaper
Or, monitor, or barrier to synchronize this order.
Okay.
When you have this, you put the barrier, and then barrier means that when this
writing done, you don't do anything.
Then you go to the next one. So you provide in-order writing. Then the reading,
whenever you read, that should be in-order read, okay? It forces the bus to be…
Yes, yes.
Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Because we would have a strong assumption that the bus is the only medium, and you
don't do split bus operation, means until you grab the bus, you will be waiting
outside. Once you get the bus, you will own that bus until you finish. So bus will
be the point where global serialization point.
Now, let's look at write-through invalidate protocol, which is the simplest cache
coordinates protocol.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so that's… this is what I want you to know, okay? So, we know each cash…
block, we will have it here. The state should…
flat is only one B, B or I, okay?
So you, you throw two circles. One V, one I, okay?
And then, there are 4 different actions you might have.
Your own PR, your own processor can read, your own processor can write.
And there are two.
Other, okay, then you will see other extensions through the bus. The bus read, you
see the read because underfach.
And then the other was, UC Rodriguez on the bus.
Okay.
Alright? So then, then from,
If I have time. So, here, valid. You go through each exercise.
Okay, etc. So let's say, with which one? You have a processor read, what happened?
You go to the, you know, cache…
Tag matching is tagged, and then validity goes 1. You have data. What happened?
You have your own, you read, right? Do you have any state change? No. So it goes
back. Do you understand this?
You… you'll go back, okay?
Alright?
Clear?
And how about your own right?
You look at your cash, it's a hit, but now you're riding.
Remember, this is a ride-through. Ride-through means…
you need to write through memory, right? So it's… in terms of status, it's stay B.
There is no transition.
However, you have a consequence action you need to take. You need to put bus ride,
right?
You need to put ride requests to the bus, so that it reaches the memory, you
update.
Dan. Okay.
Alright.
So, let's do two other things. While you read… you snoop, you see bus read. Someone
try to read. What do you need to do?
You don't need to do anything.
Why?
Because it's a ride-through, right?
Even I have a copy, I don't need to do anything. It will eventually go to bus,
right?
How about bus rides?
We're having V copy.
Remember, this is an invalidate-based model.
You have a V.
You hear bus ride, someone else, I have my copy.
Anna!
Hannah, right?
Either hit or miss, she will put that ride request on the bus, right? Because it's
right through. It goes to memory. I snoop what I need to do.
Change EV?
2. Invalidate. Can you see this arrow?
What else I need to do? Nothing, right? Okay?
So we take care of all four actions from V. Done?
Okay.
Troy.
Hi, yourself.
And then try to finish this, from I, I state.
Go through one by one, okay?
If you don't understand this, you won't be able to understand the other protocol,
right, Bec?
And from here, if your answer is the same as this, Answer this question.
From this protocol, what do you…
speculate. Is it a locate or a non-allocate policy they have?
You remember the discussions we have? When you have write-through, you have two
policies you can have. First time miss.
Either you can allocate, or you don't allocate.
Okay.
Now that we're looking, why?
Good link.
is going to the memory, and can you… can you justify your answer with this
protocol? Where do you find the evidence?
If it is allocate, what happens?
That's a righteous, like, still good.
So, okay, I'm writing, right? I'm writing.
But when I check, it's an invalid, it means I don't have a copy. It's the first
time, right, right?
So, for Italy, I need to put… what's right, right?
Let's look at this. This, coming back to our means, I don't bring it.
Can you see that? I check my cache, I don't have it, then I will put, right, recast
on the bus, so that it can reach memory, but…
I want to change this status to B, it means I don't bring it, I don't allocate,
okay?
That's the one thing.
How about from eyes when you try to read?
read the mix, you need to bring it, right? So it will change to B, but, but what
you need to do? You need to put a button in, right?
So we're taking care of these two.
How about other kids?
Bus read the bus ride.
You listen, punch me. Oh, then I click the address, I assume, zoom in.
When I see, it's I.
Which means I don't have… Do I need to do anything?
How about you, you see a bus ride? I checked mine. I don't have it. I don't need to
do anything. That's it. We're taking care of all 8 different actions, okay? Throw
it…
And then we can move on.
Who was released, in the balance sheet.
She's from I?
Oui?
New bus, right?
This one? Oh, this… okay, the first,
variable says whether the action is from my processor. We talk about all four
different things, right? And then.
Consequent actions, whether you need to do another thing or not. So when you… when
you ride, you need to put bus ride.
When you read, you need to put… if it is missing, you need to put bus read.
These two, three, right? You have, actions to the boss.
This is Thor's 4?
This, you, when you read the, you know, your action.
So when I put right bus, Right, bus? Because I'm riding.
other people look at this action so that they update. Can you see that?
So, when you're doing processor 5, and the state is invalid.
So, we still need to get the copy, right? No, we don't do it. What happened? You
just update in the right through, you just update memory, you don't bring it,
that's why it is…
I, okay?
Alright? The initial value. It's just a right. Yeah, because it's a write. You
don't believe…
Any question? Can I move on?
Did you finish?
Shall I collect your paper for extra credit?
Actually, Oh, good. Yeah, good. Undergraduate students.
Right? I ask it to throw, and then I check whether you are right or not.
And my son laughs at that, and he also does that to her. He's a team leader, so his
teammates ask to do some work, and then he always touches it.
All right? So, if you can grow this respectfully, understanding this, we can move
to the next one, yeah.
First, Krista?
So, in the valid state… Valid, okay. Bus ride comes. Bus ride comes? So, what will
happen? It's the same as, processor read. Okay. I'm sorry, bus read comes.
Okay, so let's just think, okay? So you, you, you are the processor.
Okay, and then EJ, you see EJ writing something. To the block, you have, because
it's V.
It should be a single. Must read?
So you have a copy, right? And then I put bus, right, bus read, because I don't
have it. Who has a copy? Do you need to provide? No, memory.
Because memory is… you're right through, memory always has up-to-date copy, so even
you have a valid copy, you don't need to do anything to simplify. I know for the
performance benefit, you may supply that, right? On the bus, transaction may be
much faster than
The Memory Foundation.
Okay, but for… in terms of protocol, just protocol being you… we want to provide
correct behavior, we don't worry about performance, okay? Performance-wise, you can
provide, but we don't go for that
Advanced optimization, quote-unquote.
What's the purpose of adding busri at all, then?
Like, you're just filling up the bus all over.
So what's the purpose of a bus read?
You can just read from the memory, like, processor.
Okay.
So, okay, bus read, when you put the bus read, okay, from your side, think about
it, when do you put bus read? When you actually want to be from…
Yes.
When you don't have it in your local cache, isn't it?
There is no other way to get the data. You need to put that request on bus, isn't
it?
There's no time. Okay, understood. Because you have a cash.
So first, you check the cash, and since you don't have it, you will recast that
read recast to the memory, right? So then others can hear. Can you see that?
Any other questions?
Okay, so let's get all these things. Okay, you can… someone who cannot draw,
please, please go through 8 different actions. For each state, you have 4 different
actions, okay? That's all you need to do.
So, let's…
mirror?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Try to explain one more time with the data, yes? So with this.
right through Snoopy Cache Corrence Protocol, we can see, right established a
partial order.
It doesn't constrict any ordering between reads.
Right? So, since we are using shared median bus.
Dad will order even with misses.
Any order among rice is fine, as long as in the program order.
So from now on, we will see Snoopy cache coherence protocol with write-back policy.
So note that with write-back policy, most updated copy can be in one of our other
processors.
So, normal cash tags…
Kim, Eun J
Kim, Eun J
[Link]
Do you agree?
Right back means when I have a cache and the cache hit full, right, I only update
to my copy, isn't it?
Alright? That's a difficult situation we are running into.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
be used for snooping. So whenever readmiss or ride miss on the bus, that's… when
you snooping the bus, read the bus ride, you are
tag matching. If, oh, is this what I have or not? So, tag will be used. And
validate per block also makes invalidation easy. So, remember, for now, we are
looking for write-back policy, and then we will do invalidate. If I have something
going on, but then I'll… instead of update the value, I will invalidate the copy I
have.
So rhythmis can be easily handled, since it will
rely on snooping, everybody snoop. And right, actually, there are two cases. If no
other copies around, then no need to place right on the bus for right back, because
right back, there is nobody, and… but you don't know, right? And other copies, if
there is other copies, you need to put embedded. So how can we handle this
efficiently?
So unlike write-through snooping, write-back snooping may need more number of bits
in terms of the state of each cache block.
So, there is a notion of dirty, right, updated or not. So, to track whether a cache
block is shared, we need to add an extra state bit associated with each cache
block, like valid or dirty bit. So, when you have to write to the shared block.
then you need to put invalidate so that other copies will be gone, right? So, for
example, you are trying to write. And then, if you need to figure out if there are
other copies, then it is, invalidate-based, right? So, you need to invalidate other
copies.
Then you become the owner of a cash block, which means after you make an update,
right, any misrecast
When I read the request. You see on the bus, you are in charge of supplying the
data, because you are the one who have a most up-to-date copy.
So let's look at cache behavior in response to the bus with snoopy cache coherence,
invalid database for write-back policy, okay?
Every bus transaction must check the cash address tags. So.
Every cache controller needs to compare tags for the cache you have.
However, tag matching will be required from processor side, too, right? So you will
have a tag matching from cache controller for the bus.
Which is for cache coherence, and one from processor, when processor needs that
data. So, how to handle this interference?
A way to reduce interference, maybe, is to duplicate text. You make two copies, one
for processor, one for bus.
The other way, you want to have this cache clearance protocol resides in L2, okay?
So L2 is less heavily used than L1 by processor, right? So interference will be
smaller, but you need to have L1 and L2 inclusive, means all the data in DL1 should
appear in L2, okay?
So, you can think about it. For now, we will just, remember there can be
complication due to the inference. But for now, let's assume there are duplicated
copies, so that it won't interfere with the CPU performance.
Kim, Eun J
Kim, Eun J
[Link]
So when you have a duplicated task, you need to provide consistency, coherence
between them, even, right?
So when you… you…
What do you mean by… when you snooze, means any cache log address you capture, and
you check with your tag, right?
Then, during tag matching, CPU may have a load, may have a store. They also need to
access cache, the tag matching, right? So you can't have only one place, because
they are competing from one cache coherence,
controller for cache coherence protocol, the other, CPU. CPU should have a higher
priority, so we can do
two copies of a tag, right? Then always make sure, whenever one side updates, you
need to update. That's a difficult problem, even, okay?
The other, you have this protocol in Level 2, not Level 1. Level 1, every time load
store happens, CPU will check tag matching, right? But if it is level 2,
depending on the heat rate, like a 95% heat rate, only 5% of load store come to
L210, so that would be each one, okay?
Nope.
When we say we are duplicating the pack, it means we are duplicating the hardware
unit, which… Yes, yeah.
So tag, nothing but, you know, it can be implemented through CAM, Content
Addressful Memory. So there is a hardware structure, right? So you may have a
redundant two copies of a tag matching unit.
Okay? And then CPU side one, and then cache control.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Let's look at an example. Right back, Snoopy Protocol.
Basic policy for multiple copy is in validation.
When you have a write-back policy on right.
What it does is snoops every address on bus. If it has a dirty copy of requested
block, which means you are the only one who has the most up-to-date copy, you need
to provide a block in the response to the read request, and then that read request
won't go to the memory.
So, there are states of each block in memory and in the cache. Actually, they are
corresponding to each other. Let's talk about cache block state first, okay?
There's this one here.
Each cash block will be in one of three states.
First, if you don't have a copy, invalid. It's the same with the cache
architecture.
You have a copy means, yes, you have a copy. There are two different copies.
You have a copy, but the copy is clean, okay? So, we call it share. This blog is a
share. So let me underline what I'm talking here. This is share, okay? And then,
when you don't have a copy, it's invalid. If you have a copy, but it is only for
read, it's shared.
You have a copy, you are the one who update last time is exclusive. You are owner,
okay?
Then, if you see any video recast from us, you need to supply.
Corresponding to this state, the memory block will have a state 1 of 3, too. Okay?
So memory block, if none of the cache has this block, you will have, you know,
empty label.
And then, if there are cache blocks out there, only for read will be shared.
If there are cash flow out there in… by… Processor in the cache.
But it is very updated by processor, it will be exclusive, okay?
You may have a question, how processor knows it is exclusive, right? When processor
write, somehow there is a way… memory should know writing going on, so that you can
put your state as exclusive. So let's look at that.
With this protocol, whenever you have a read-miss, all cases snoop the bus, okay?
Then they will take action correspondingly.
Rights to the clean blocks will be treated as a miss. Okay, here is the hint.
you need to make it as a miss so that memory should know writing goes on. Do you
know what I mean? So you had… another processor has their own copy, but you are
trying to write. There should be a way the processor should be aware of this
writing going on, so we treat it as a miss. You should put write in the bus, so
that memory knows that action going on.
Kim, Eun J
Kim, Eun J
[Link]
So, okay, I try to memorize this way, because if you understand right through,
right, right through, there are only two states, whether you have data or not.
Here, with the right path, you can think you don't have a data eye, invalidate.
If you have data, either S or M, like, it's exclusive, but in the figure is M, in
your textbook, M, modified, okay?
S means shared that you have a copy for me.
You have free ride for read, because it's a heat, right? You have a hit, but that
is for read, S.
M, or exclusive. You have a copy for write. You can interpret that way. So how
about this thing? I'm trying to write, and then when I checked my cache, it was S.
This, just the slide says, you should treat it as a miss, because you have a copy
for read, you don't have a copy for write.
But earlier, when we uniprocessor the heat, right? You have a hit, you update your
own copy, right? However, we treat it as a miss, so that you update your own copy,
but you should put that request to the bus, so that memory should know
You are the one.
And what if, S, there are multiple sharers?
ass, he has ass, he has S, right?
Other tool also has copy, but I'm about to update.
what I need to do, I need to put right miss on the bus so that they see.
Then they… they snoop, they look at that, oh, check, oh, I have…
Hit for the read, but someone else tried to read. To write, what do you need to do?
invalidate. Can you see that?
You listen, someone tried to write, and I have Copy for read, S.
Which means that this S copy is no longer
Correct one, right? You invalidate it. Can you see that? That's all you need to
know.
There are more things coming, but this is the most critical thing. When you have an
S copy, someone else try to write, and you invalidate.
How about you have an M copy? Exclusive copy, someone else tried to write.
What do you need to do?
Okay?
Can we support multiple exclusive, multiple modified? No, there is only one up-to-
date copy, isn't it? So, which means you need to
Yes, very good point. You need to invalidate, but before invalidate, what you need
to do, write back to the memory, because you are the one who has mostly up-to-date
copy, not memory. So you need to write back to the memory, because it's a
replacement happened. You write back, so that the other one tried to write.
he will get the up-to-date copy from the memory, and then he update. Why do we need
to do that? If we have, in a block, we have only one variable, do we need to do
that?
Okay, the situation here.
What's your name? Toga. Toga. Tolga. Toga or Tolga? Toga. Toga. Toga tried to
write, okay, in one block, block.
Giro, okay?
And then he doesn't have it, he put the miss, and then I listen, I check, oh, I
have a block 0 here. Okay, I update… I cop… I have a copy.
One scenario, I just invalidate. I change I. That's it.
And because he will update, right?
Why do I need to write back my copy to the memory?
And he should get updated copy, then he write.
The point is, if there is one variable, only one variable in a block, you don't
have to do.
Because even I put S equals 5, he will put S equals 7.
Before 5, it doesn't matter, right? You just override.
Why do I need to write back to the memory? And then he got… That's one end update.
The answer is on the hint. There are more than one variable, right? So, okay, here,
I change S too far.
So others, let's say, 0, okay, S25.
then he wants to update T value in DEPLA.
If I don't do it, it's just invalidate, and then he will copy only… he will update
only T, means S value you lost, I updated. Can you see that?
In a block, there are multiple variables going on, so you always need to get most
up-to-date copy from memory.
In order to do that, someone has before a dirty copy I need to write back so memory
can supply the up-to-date copy.
Okay.
That's a protocol.
All right, so time is up. What can we do, right? Because this will be in the final.
Here, we will do the, drawing together, okay?
You can do pre-study for the final.
This is, the time we stopped here. From here, you drew yourself, okay? And then we
have a quiz of 31. Do it by yourself.
Can you promise? By Wednesday?
Wednesday, okay? Then we will… we will do it one more time together.
If you do by yourself, you will have tons of questions.
Tons of questions, and then solve those questions with me on Wednesday, then you
will ace this question for putting fine, okay?
Bye.
Okay, thank you.
He said she'll tell herself what they need.
So you can talk, yeah, do withhold.
Dec 3:
Yeah, it's just like… Like, there's two ways of representing it.
Why is it… what? Why? Why does… why does this have a 199 instead of a 99? The one
that doesn't… anything? It probably points to… It points to something that's empty.
Why isn't it empty itself?
Yeah, but then the 6 is the indirect block? Right, so that points… the indirect
block, and it points to block 1.
Oh, oh, Bachman switched to 32 and 3.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides.
Kim, Eun J
Kim, Eun J
[Link]
And then that's it, that's all the… there's only… The midterm has, like.
I had to, like, split it between what's the data chain and what's the interior
chain, because otherwise… Yeah, and fifth number. Yeah, we have, like… one, like,
thing at the same university.
It's supposed to be either called Exam 2. I spent all day, like, going over the old
videos, and then I'm like, oh wait, I still have a third project to work on in this
class.
I don't know if I'm behind. I don't have time to focus on.
Actually, yes, I'll be created here.
I got the invite, I don't know.
fully firing.
Nice to meet you.
I'm glad to see you!
Yeah, but, like… That system.
But I have never… Okay, let's begin!
Good afternoon!
So we have less, less people toward the end.
And I think I'm recording, so…
Any question on the material we discussed last time?
You're okay? Where are we?
We are working on.
Mostly people. No people.
protocol. So what is… why do we need to have a protocol? What is… Careerance? What
is coherence?
And the same data between different caches. Different caches? So, you need to give
a big picture first, right?
So where are we? Which… Chapter.
The title of a chapter.
VertiCore, right?
Multiprocessor system.
So when you have a multi-processor systems.
You communicate, you collaborate through global
variables, right? Shared memory, okay? Then, like you said, you should start with
the big picture first.
And then you're… you have caches, Distributed, right?
Each core, each processor, they have their own caches. Then what happened?
A lot of headphones.
one person needs a change, then it has to propagate to any other cache.
So, so you have…
Multiprocessor is so complicated, so with this class, we will have only 2 or 3
processors, right?
For 3 processors, and then,
The collaborate through variable means we are writing read the same variables,
right?
The problem of cash? You have your own cash, I have my own cash, what happened?
You have copies, right? Copies means redundant copy.
Right? So then, what happened?
One of our processors, right, modify it.
We see all of them.
You're talking about solutions. What's the problem here?
Why do we need a protocol? The data is not consistent.
Data is not consistent or… right? So, one processor updates the value, the others
should see the change, right? That's the main problem you're tackling.
Main problem is you're settling.
And, so in that category, we learned the most simple protocol first. What was it?
So, is this coherence protocol related to reasons or rights?
Which caused problem?
Right, right, write, update the contents, right?
Okay.
So, we need to classify the protocol we're gonna learn, right?
For how we give a biggest category.
What's the criteria?
It's about right, and that you learned cash… design, right?
In terms of write, we learned two different policies.
Right through. Right through, and right back, right?
So, the last class, we learned, like, write-through-based.
Right? So, why it makes simple?
There's no broadcast somewhere on the bus that needs to be resolved between
different caches. It's only memory infrastructure.
So, ride-through. So, definition of… can you give me the definition of ride-
through? What is ride-through policy?
So when you… whenever you update, right, you need to give a condition first, and
then action, right? Whenever you have update, right.
that you need to write through means update both the cache and memory. In other
words, memory always have up-to-date copy.
That's why it… system becomes very simple.
Right? I write something, but I write to the memory, too. So whenever Ana, another
processor, need that value, what it does, it can go to memory, right?
And because I always go through memory, that active… that transaction can be
observed by other processors, always, because it's outside. It's not only my local
CPU, local cache. Can you see that?
Why this becomes simple? Because of, like, this way, we have a common broadcast
system, whatever I'm telling you.
There is no privacy, right, in the visual. So everybody can snoop, right? So, let's
define the protocol we learned last time. What was it?
We already discussed, that is about right-through, and then SNOOP-based protocol,
okay?
Then you will be fine. That's the simplest thing you learned.
Okay.
So, in terms of number of states, we have only two different states, right? Valid
and invalid. And if invalid state, you consider four different actions. Four
different things can happen in multiprocessor system.
Your own breed, your own right.
Or, you see bus, right? In the bus, there is someone else, read, someone else
write. What you gonna do? And then you, you can draw, you can complete the finance
statement, right?
I'm not joking, I can't put that cursor in the final, right? You should be able to
draw.
So today, the goal of today's class, we learned is
the MSI protocol, okay? So, write back, SNOOP-based protocol.
So we learn right through. Okay, what if we have a right back?
Because the write-back, write-through problem is, whenever you update, you need to
use bus, you need to go to memory, it's such a slow, it's not efficient, okay? A
lot of systems, they have a write-back policy. Only when the block is kicked out,
you update memory contents, which means
The most up-to-date copy can be…
in other processors, okay? So you need to collaborate.
To maintain coherence.
Okay, so let's look at… Open your empty paper, okay, throw it together.
Every single line, you need to understand why we have this transition, okay?
And, of course, without understanding, you cannot memorize, okay? You need to
memorize.
Because I will give you the quiz 31, you see. I don't draw… I don't give you a
finite state machine. I just say it's the MSI protocol. And this is the initial
cache. And what happened with this, these, these operations, okay?
So you need to know.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Memory knows that action going on.
So let's look at finite state machine for Snoopy Cache Coherence Protocol or write-
back policy.
Kim, Eun J
Kim, Eun J
[Link]
Okay, I will give you a time. Throw three lines, three. Okay. Leftmost eye…
Invalidate, and share, and exclusive. In other words, it's M. So, open time, it is
called MSI protocol. You see MSI protocol.
Yay?
So…
I keep saying this is a finite state machine, right? Finite state machine. Have you
ever designed control… like, there are a lot of ECE students and CSE students. CSE
students, I believe, in your digital logic class, or, I don't know, your
curriculum, but you should…
Have experience to design sequential logic.
Have you built a sequential logic?
given finite state machine. You could… Implement the controller.
You got it? You did? And then, EC Eastern, you have a controller closet.
Itself, right?
Separate controller class.
So… Let me put it fully.
video overview. It's all about hardware.
-Oh, I didn't share?
Okay, okay.
So… The, the one we draw before, let me simplify BI.
And then, let's say it's, just the cloud generation finance state machine.
So you have a V, and then whenever clock comes, you go here, and then whenever you
clock down, you go there. Another clock, like that. You oscillate the VIVI, okay?
But I don't want to have a VI. I can change. Okay, so input is X,
Okay, so when input equals 1, it goes I, and then if it's 0, it stays here. If it
is 1, it stays here. Whenever we change to 1… I can, okay, here, like this. Okay,
let's say I draw this finite state dimension.
Can you build a controller?
Logic.
So what's it? How do you do that?
We talk with, and what is the truth is.
This is, your input.
And then, let me, let me give you the, the output, Y. Y is associated with the
state. Here, the state equals 1, this output equals 0. Okay.
And then your input… this actually implicitly actually end with a cla, right? Do
you remember? Synchronize. This is a…
Sequential logic means it operates with a clock. Whenever only clock comes, you
read the input.
Okay, combinational, regardless of cloud, your input keeps feeding into your
system, end or gate, right? Anytime your input changes from 0 to 1, 1 to 0, your
end or gate output will be changed.
However, sequential logic only operates with clock. You read the input only when
clock comes, clock H comes.
So implicitly, every time you read the clock, you read the input, there is an
implication, okay? But.
Because this is always, you should end with a clock, is there any information
there? Always means probability go 1.
Okay, the WH students, you learned Shannon's information theory, right?
Communication class.
And even 312, my students, I talk about this channel's information theory.
When probability go on, do you have any information there?
Probability equal 1 means it'll always happen.
It always happens. Like, I joke with this with my students.
So, next Monday.
TA run to the class.
Two minutes before class.
And she said, oh, EJ is coming.
Number one.
The other case, she runs to the class. She said, oh, Iz is not coming for today's
class, she's sick.
Which one has more information? And you should explain with the probability.
Number one or number two? I know you're happy with the number two. Maybe you're
upset, because it's just the…
The last class before final, right, you want to have a class.
We both have one bit of information.
Either she coming or not. Yeah, yeah, yeah.
Okay, so if you make a newspaper, which one, which event should be in the paper?
Two. Two. That's why, because it has more information. People will be curious about
that, right? So how do we quantify information? Because of probability. Probability
that EJ is coming in 2 minutes.
It's most likely, right? I never miss class without any announcement. Probability
she's coming is a 1, okay? We're okay, why she run and then tell this?
Okay? There is no information, because it's always… oh, tomorrow we will have a sun
out there in the morning. It's no news, okay?
If the events happen.
which occur very rarely, probability goes very low, that has more probability
information. So Shannon's information theory, like, if you see the entropy.
This is the formula they're using. Reciprocal of P is proportional to D, whatever.
Okay, this is what I like about, okay, here.
Clock signal is always there, always should be end with. It is not information, so
you want right there, okay? Do you know what I mean? When you write your return
project.
If it's, for sure we know everything, it's a probabilitable one, you shouldn't
explain, do you know what I mean? It's a redundant thing.
Okay? So, only one place. So, this is it, okay? Let me briefly talk about how to
build… based on this, because you guys need to have a big picture. You have a
finite state machine. Remember, finite state machine
The cache, you have tag, And then bits, okay, here. 3 states, how many bits are you
gonna use?
2, right? So let's say this is a 000110. We include the state.
Okay, so this is a 0 and 1.
Can you see that? We include the state. And then do you learn truth table?
You can throw a truth table, right?
So someone smiled, do you recall how to build this controller on given final state
machine? If you don't know, you should review, okay? We are hardware designer. Oh,
no, you're not. You are just taking the class, right? But when you interview,
people expect you to know these things, right?
So what do… do we do?
Okay, you have input, right? X is input, and you need to determine Y, output. That
is a truth table, right?
sequential geology, you have based on current state, right? So your state name. So
your S is there, okay? So S can be 0 or 1, okay? And then this is a current…
S, and then next to… You can build this based on this, right?
Right? That's it. That's all you need. Then with the truth table, you can build a
combination logic, right?
Okay, so you have two parts, combination logic part, and then here, you have only
one deeply flaw.
Deeply flock, right?
the flip-flop, one… so that is S.
And the input is X. And it will determine… Why?
And next to state. Can you see that? This is your controller design.
So, here, what I want to show quickly. In your cache, every tag, you have a bid,
right? 0, okay?
Oh, the current one is 01. And based on this activation transition, it can go to 1,
0, or it can go to 0, 0.
Following finite state machine.
And how do you file a financed state machine in hardware? This.
We have a… Circuit, okay?
So, okay, this is a whole picture. I don't need to go to the detail of transistor
level, right? So in this class, if you took my course, I believe you can build this
computer from endor gates, right?
You should have those information here.
Okay, any question on this? Let's go to finite state machine. So it's a finite
state machine, big deal. In hardware design, every time we have a controller, we
need a controller, we draw a finite state machine, and someone has come up with
this logic, okay?
And then each bit changes, look at this all. In a draw, in your final exam, it's
only two bits, but behind, there is this final state machine there. Hardware there,
okay?
Okay, let me clear it out.
consider.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Like we did for write-through, we are having four different requests.
One, CPU read, the other, CPU write. So these are from your own CPU.
In righte3 example.
Kim, Eun J
Kim, Eun J
[Link]
So, do you remember we talked about 4 different operations that happened, right? My
own read-write, or the bus read-write, right?
Because this three states is so complicated, we will handle only my read-write
first.
Throw that, and then we will add the bus read-write later, okay? So, only think
about yourself. You're one of multiprocessors, and then what happens if you want to
read, or if you want to write, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
They were noted as a processor read, processor write. It's the same with this one.
CPU read, CPU write. In this slide, we only look at these two requests, whereas the
other two, bus requests, there are two kinds. Boss read, bus ride, will be handled
in separate slides. Let's look at, from each slide, each state.
Let's say if you are invalidate, okay, for that cash block, then
When you are trying to read it, you're gonna have a miss, right? Then you're gonna
bring it from memory, then the state will change to shift.
In shared, if there is a CPU read, you know it will be hit, right?
So let's go back to invalid state to finish. There are only two requests. CPU,
read, we've already seen what happened. How about CPU write? Of course, you will
have write miss, right?
Kim, Eun J
Kim, Eun J
[Link]
What do you do? So, it can be overwhelming if you try to come up with a complete
solution. First thing.
Think about transition. So, you are in invalidate, right? Which means you are
trying to write, you have a store instead.
when you run PC, go to a certain memory position, and then when you look at the OP
code, it was a store. Store means you are writing, right? So then, with the memory
address you calculate, you check, you go to the tag, right? And the index bit, and
then you see the
state bit. And then it is invalidate. Invalid means I don't have it.
So… Where are you to Hugo?
it's missed, it's write back, you go to memory, if you're the only one, right, you
go to memory, and you will bring, you will put in your cache, right? So what will
be the status?
Space to fit.
Is it shared, or…
E or M here.
Okay, is… it wasn't load. Load means you are writing, or you are read, right? So,
it is store, you are writing.
So you brought a block to write.
At the end of this section, you update, right? You're the one has modified copy.
So where your rule would be? In shared, or…
Exclusive, or modified.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Right, miss, and you're gonna bring the block, but then you are about to write,
right? With write back, it will record as exclusive. Okay, because you are having
the cash block, and you're updating, so this will be the most updated copy. So, for
that cash block, you are the owner. It will be exclusive. Okay, we are done.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so look at your draw, drawing for right through Snoopy.
What do we have? We have the PR, read, PR write, and then slash, right?
So, someone asked last time, what is a slash? You put bus through cast something
like an action, right? So this is it. So, when you have a CPU right.
You need to put right miss on the bus. Can you see that? This action?
And then you will bring the block from memory.
For now, don't worry about how others will
take action upon to this mystery cast, okay? But for me, just, I put the mystery
cast because I need to get the block from memory, and I will bring, and then my
state will be E, or S, right? So…
Okay, very quick.
So we are done with the invalidate state.
Yay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Done with invalid.
Then next, let's finish…
Kim, Eun J
Kim, Eun J
[Link]
Okay, so we are… let's say we are in shared. So, shared means you are… your PC,
whose current instruction, it was load.
Load means you are reading memory, right? And you calculate address, you go to the
cache tag, and index speed, and then you see the state, and it was shared.
Okay, so if it is a load, it's a hit, right?
Is there any change in terms of state?
No. Okay. How about CPU? Write? You are writing.
But you have S copy. Copy for S.
So?
You should.
You can modify, right? But you should… You should put…
Ride miss on the bus so that other can see.
S, there is a danger. I can be the only S. Okay, in this protocol, we don't
differentiate. The next slide is set, we talk about optimization. There, we will
have an E or O other state in, you know, to…
improve efficiency of the protocol, but here, we have justice. I'm the… I can be
the only sharer, or there are other sharers.
In case there are other sharers, what should happen?
share, like, Anna, you have… so, I… I have a copy, you have the same copy, but I'm
about to update. What should happen to yours?
Invalid, right? So, how do… if I just directly update my one, because it's right
back, I can just change mine, right? Without let her know, then it's…
breaking coherence, right? So I need to put… I need to… Traded as a miss.
Oh, I have a copy for read only, so I need to write, I have a miss, so I put that
request, okay?
then Anna can hear, So, in terms of state change, it will be changed from S2, E.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
of shared state. Shared means when you look at CPU read, yes, its heat is same, but
how about CPU write? If you try to write on shared data.
The thing is, there will be other shares, means other processors may have the copy,
but they are also clean. Shared means this clean, it has not been updated. But what
you need to do before, right?
You will?
place right miss on bus, even you have the copy for read, because you don't have
access to write, right? Because shared means only read only. So you will put… you
need to put right miss, and then you go ahead to write.
Which means other shaders, other processors with a shared copy, when they see
write-miss on the bus, what they need to do.
It's an invalidate the base protocol, right? They need to invalidate their copy,
okay? We will talk later. So here.
In exclusive state.
Kim, Eun J
Kim, Eun J
[Link]
Okay, you have a question? When it's actual right means the memory sends back the
data. In this case, will the memory still send back the data? Do you need to get
data from memory? You have a clean copy, so you can go ahead to update.
So here, consistency problem comes. When you really update, and there are some time
gap.
Others still didn't get invalidate, or didn't have enough time to invalidate, but
you update. That can happen.
But in coherence, we don't care, okay? Since I put it, I can modify, okay?
Alright?
Okay, so let's talk about…
now we have a two case for each invalidate shared, so now we are in E.
Actually, it's M, okay, and we… in our side, we call MSI protocol. This is… stands
for M-modified SI protocol, MSI protocol. You're just learning MSI protocol.
So in M state, exclusive state, what happens when you have a load instruction?
you look at your cache index, right? You find the cache index, you go there, and
state build was modified, you try to read. You can read, right? You can say, you
can read. You don't have any other action you should do, right?
Okay, how about right?
Same thing, right?
Yeah.
Nothing. Because you have an M copy.
Do you need to let other people know, other processors know? They… right?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
If you have a V,
or write, it will be hit, right? You don't have to change the state. It will stay
as exclusive. Exclusive means you can read and you can write.
The other one actually is exclusive, but you are having miss, okay? You tried to
write, you are having miss. Why? Do you remember?
Kim, Eun J
Kim, Eun J
[Link]
I think this, this, the, this one, we just explained, tried to explain.
Actually, I don't feel the necessity of including here, in terms of, the…
State change, because this case is… When you have…
store and calculate address, you go to the cache with index. Okay, index.
When you check, go to index, this is just even before. We assume tag matched,
right? This is mine, and then hit, right? This is mine, it's M, yes, I'm trying to
read or write, it's a hit, right? But…
What if… what if this is… I go to correct position with the cash index, but tag has
not been matched, but that state was M, modified. Okay, what happened?
Okay, so… if I gave up.
more detailed example.
So, with the… do you remember, with the cash…
address, we have a three field, right? So the…
The one, when you check a cache, the stored tag
store the tag was A, and then let's say state was…
This is 1-1, so your state.
Tag should be… okay, stay.
the opposite. So this state bit was 11, and the tag was A, okay? But you have this
store. The address you calculate, it was a B11, and then offset, okay?
So, because this is, let's say it's 1100, something like that. It's 110, okay?
So, since the index is 110, you go here, okay, and then you'll do tag matching.
This is… different.
Okay, then, tell me, if it is single processor, what do you do?
You go to memory, you will bring a new block, right?
Replacement happened.
Okay? So next time, after this section, it will be changed to B. Oh, changed to B,
but… because a replacement happened, right?
How about this…
statement. You brought this for store, still it would be 1-1, isn't it? Actually,
it was… when you go there, it was 1-1, but it actually is…
Missed for my block, isn't it? The block I'm looking for is not there. But it
happened to have another block to share the same position, the…
State bid was 1-1, okay? So it'll be like that, but we need to…
Take one more action. What happened? 1-1 means what?
This A block… Ever since April.
It has been changed, isn't it? And then, this is the only coffee you have in the
world.
So can you just wrote B and then replace? No. Write bad means you need to write
back this A to the memory, right? So this is all about that, okay? It's a
replacement.
Case.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Remember, directly mapped cache? Directly mapped cache with the same index index.
There are other blocks have a conflict, share the same spot, right? So here, right,
miss happens, although your state is E, tag matching doesn't happen.
So the blog you're having is four different ones, so you need to bring it, right?
So you will place right miss on the bus, and then you need to write back
The colored one, To do this.
memory. And so that, once you bring the new block, that state will be exclusive.
So here, let's look at what'll happen on busted requests.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so you can…
Use the one you already drew. You draw, right? Do you need more time? Did you
finish?
you should. Every arrow, you need to understand, and then on top of that, you can
add these additional rows for bus, okay?
But this slide, I separate them.
So that you can clearly see, okay? Here, now forget about what happened by your
own, because we are… we are done with it, okay? So, now, you listen to the bus, and
you hear something, and then when you do snoop, this is the…
state do you have in your local cache, okay? Then what you gonna do?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
There are two different buzzwordcasts on ReadMe's passwordcast and write-miss
buzzwordcast.
Another word, other processors have… when they have readmiss or readiness, they
will put Devocates on the bus, and then other processors snoop
Listen to that, right? So what are you going to do on that bus request? Let's talk
about from invalid state. So you'll keep snoop, and then let's say you see
readiness on the bus. What are you going to do?
Nothing, because you don't have the copy. It's invalid, right? So you look at tag
and tag match, and it's invalid. You look at the cache block position, it's
invalid. You don't have a copy, do nothing. Same thing with right miss.
Kim, Eun J
Kim, Eun J
[Link]
Do you agree?
You hear something, and then you catch, and you do take a match, and then you go to
cash index, right? Cash index, and then you check the state bid, it was 0, euro.
Do you need to do anything? Nothing, because you don't have it, okay? So you will
go back to that step.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
when you look at, and oh, invalid, nothing. However, let's start with exclusive
state. Exclusive state, when you have a readiness.
Which means you're having most of…
Kim, Eun J
Kim, Eun J
[Link]
The answer is there.
So, you hear, and then someone, Anna tried to read, but then when I check mine, how
I check with the cache index, this is the exercise we're gonna do. With the memory
address, you figure out what is the cache index. I go there, and then I check, it
was 11, means exclusive. I have exclusive.
She tried to read.
What?
Good citizens should do.
You'll need to write back.
To the memory. And then…
Memory can give to her. How about my state? Are you going to invalidate, or share,
or stay M?
The next thing you're feeling, right?
It can be exclusive.
It can't be exclusive, right? No, no, she, she tried to… Okay, so I have an M copy,
most up-to-date copy.
And then…
She wants to read, so I want to give the data, and then I write back in the memory.
So she will be definitely in S state, right? How about my state?
So if I had keep M, what happened? For her read?
So, let's go back to earlier slide. When I have N,
When I tried to update, what happened?
M, you don't do anything, you just update, right? But she has copy for S.
So, she would read, Old version. Can you see that?
So, no, there is no such way you have coexistence of S and M for the same block.
Can you see that?
I have an M, she tried to read, I write back, I need to change I or S.
Which one? Both of them will be fine, but it won't hurt to keep as S, because if I
read, it's okay, right? Still, it is…
Write copy.
Because she only has copy for read, right? There is no case that I read the wrong
value, isn't it?
Okay? So, when…
I am in M, I read the memory… read recast, I change to S, okay? All right?
How about IMM, and then Anna tried to write
So she has missed, and then she put the dead request on the bus, and then I hear
what I need to do.
Invalidate.
But what else I need to do? I need to write back, because I'm the one most… have an
up-to-date copy, right? I write back to the memory, and the memory will give her
up-to-date copy, and then she will change. She will have Emma, my.
Aye.
Because I don't have a copy anymore. Because she's the only one who has most up-to-
date copy. Clear?
Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
dated copy by you, right? Updated by you, and someone else tried to read it. Then
what you need to do?
You need to abort that request because memory doesn't have the up-to-date copy, and
you need to write back. You need to write back that block, and you need to change
your copy status from exclusive to
Check.
Okay? So this blog will be right back. You need to take action so that another
processor
Put the reader request, we'll get that, and you will change your own state from
exclusive to share.
What about rightness? You're having exclusive copy, and then you see another
processor try to write, right? Then you need to supply, same thing, you need to
supply the block and abort the memory access, because you know you have most up-to-
date copy, not memory, right? So the other processor looking for that block should
get the up-to-date copy from your own
So, it will change, it will take that action, and then after you build the copy,
there can be only one exclusive owner, okay? So, you need to invalidate yourself.
That's all we will do with the bus request from exclusive state.
Kim, Eun J
Kim, Eun J
[Link]
Okay, then, so let's guess.
you hear, Let's start with a bus read.
So I'm the one.
I see Anna try to read.
When I check, it's S.
She tried to read.
S means what?
Shared, so… My state.
Is there any change?
Okay, so I, I will stay there, right?
But then, complete the scenario.
what happened in the system? Anna put the request, read them is, I don't do
anything, and what happened? That rhythm request goes to…
Memory, and memory will supply the copy.
Okay?
And then someone can say, oh, you have a S copy, most up-to-date copy, right?
Correct one.
I can give, right? That is…
optimized one, that will be done later, okay, with the… but for your level, for
just learning the protocol, let's not assume any optimization skill here, okay?
So memory will be responsible to supply the value, okay? Blah. But I hear, oh, she
tried to read, I have S. S can be multiple, so I don't do anything. I stay there.
Okay.
So what about when she tried to write?
And she will… because either she has S copy or I, she will put that request, too,
on the bus, right?
For her, think about it. Ride request. Each case. Invalidate, she will put ride
request on the bus. S, she will put.
Rider kiss on the bus, correct or not? Right. How about M?
She won't put the right request on the bus, right? Can you see here?
So either I can guess from there, either she's in I or S, but I am in the S.
Okay, so she tried to write what I need to do.
Invalidate, okay? That's it.
We are done.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
How about shared?
shared, when you see write miss, means you have a clean copy, but someone else will
update, right? And then no longer this copy is available, right? So you need to
move to invalid. You don't need to do any other action on this request, but you
should change your state to in there, because it's invalidate based protocol.
Kim, Eun J
Kim, Eun J
[Link]
The reason you don't have to do anything is that memory has a clean copy. Memory
will supply, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
How about readiness?
In terms of state change, you don't need to do anything, because memory, right?
Memory has up-to-date copy, it's clean copy, same copy, so that memory will take
care of. In terms of status of your processor, there is no change. That's all.
So here, we will see the state… final state machine state change with block
replacement. Here, remember, block replacement happens in your own CPU. So we… it
is the same slide we had two earlier with the CPU request, right? Actually.
The last one we discussed in this… with this finite state machine, this one, I'm
not sure if you can see, let me point out here. This one, if you remember, when you
have exclusive.
how come you can't have a right miss? I explained, when you look at tag matching,
it doesn't match. But this miss is for right. Then you will bring, right, you will
bring new block, replacement happens, but it was…
exclusive, the block was copied, wrote for right, and then now you are bringing new
block for right, too. So in terms of state, it's still…
Exclusive, but it is for different blog, okay? Tax values are different.
So we need to think about other replacement cases.
So from invalidate, you don't have anything. Replacement never happened, right? You
need to have something to replace. So let's say you have a copy, and then you need
to have a replacement.
What about CPU write? It's the same thing. Even though, whether you are writing to
the same block with the tag matches one, or the right… same cache index, but
different tag, is a right miss, right? Either it's a right hit or miss, this is the
same thing. You need to put right miss, and then S will be changed to E, okay?
But, what about readiness?
When you have a share, but then when you go to the cash index, tag match, it's
different. So you need to place Vietnamese on the bus. When you brought that block,
what will be the status of that?
Actually, it's the same as…
Kim, Eun J
Kim, Eun J
[Link]
What is… do you… can you follow?
I go to the index, and then it was shared, but tag is not matched, means
replacement will happen.
So you go to memory and you bring, right? So your tag will be replaced. How about
the state bit? You… depends on what you… why you bring, right? It's read, load,
then it will stay S, isn't it?
Okay.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Yes, right? So this place was having a block for read, and then now replacement
happened, but still it is for read, so it's shared, same deal, right? But for
different blocks.
Okay, we are done with the share. How about exclusive? We talk about right-miss.
How about right… readmiss? When you have readmiss, means tag values are different,
and you will…
Kim, Eun J
Kim, Eun J
[Link]
What happens when you have readmies from here?
And tag is not matched. Replacement happened.
It will go back to?
Chair, right?
And then you need to write back this current one, right? Because it's an up-to-date
copy you have.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
ring a block for read, right? Of course, you will move to shift, because your
replace block, new block, is for read, so you will have a value as shift.
Kim, Eun J
Kim, Eun J
[Link]
Okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
So we are done.
So this is the CPU request and the status of cache, and then earlier slide, the one
previous one, was on bus request. So now, when we talk about finite state machine,
we need to combine them.
Kim, Eun J
Kim, Eun J
[Link]
Do you have a color pen?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
to cast with them.
Kim, Eun J
Kim, Eun J
[Link]
I will, I will have a colored way, so differentiate the CPU or bus, okay? I will
give you two more minutes to throw it, because for the next quiz, you really need
to see this final stage machine.
Okay, I don't have a space to put it somewhere, okay?
Throw by yourself.
That's the only way you can learn.
Before final exam, I will throw…
50 times more, so that I fully understand. You know, slight misunderstanding cause
how much, right? You're experiencing during the term.
It can close the huge points.
It used to be highlight of this class, Cache coherence Protocol, because at that
time, all about CMP, chip multiprocessor, now it's GPU, right?
Trend the shifter a little bit.
Unfortunately, I don't have time to cover accelerator design, but…
You can learn from logic, digital logic, design.
Huh?
When you… you finish the drawing, I can get done.
White, white, and changes color.
color,
the… I think that I… I tried to have a system where CPU, my own CPU, purple, and
the bus is different color, and then the black replacement is different color.
The purple is the CPU.
And then the, brown is replacement.
And then orange, I don't know.
Yeah.
Okay.
Let me… let me go back.
Oh.
when I copy, it doesn't have a system, right?
Isn't it?
You can make your own system to memorize. I didn't… I don't remember, what was the
reason that a read miss in the exclusive state, the downgrade to shared? We already
know we have exclusive access to that block.
when I read it in, what does it matter? If it's your own read, it's okay, but
someone else… yeah, very good question.
Yeah, every error, you should have a question if you don't understand the
situation, because it appears in the exercise.
So one question.
So, let's say I'm doing, like, I saw a memory block, and then I need to modify
something. So you have a store instruction, and then you have a hit, and then the
state is S, or M, or I. S, so IC is S. I have a full read that you are about to
write. Right, so…
So how do all processors have to get the information that I'm going to write, so
they have to evaluate them, right? So, so we treat it as a write miss. You have a
copy for read only, not write. So, actually, it's the same as a write miss, so you
need to put write miss on the bus.
Okay, so let's say if the other processors don't even know if they have the same
data or not, why should I waste my time to get a response from them? There are two
possibilities. Maybe you are the only sharer? Yeah.
That's fine, right? You don't have to, you can just update. That's what you are
talking about. But what about… The protocol has to wait for…
No, no, no, I, I did, just like I told you. I just put right miss, and then I go
ahead to exchange, okay? We don't do synchronize. Okay.
Okay? There are cases where other sharers exist, so you should put right miss on
the bus.
Make sure.
Yeah, there can be… you're the only one, so we will talk about the protocol for
that case. We will create one more state to optimize performance.
Yeah. That would be better, right? Write this first, and if you, you know, the… the
more strong
cases, you want to hear acknowledgement, right? But if nobody is there, you don't…
nobody acknowledges, right? You don't know. So in… when we go to directory-based,
which I cover later, the directory-based, you can see how many sharers there.
Okay? So, but this bus, we just, okay, broadcast.
Okay, let's do Chris… 31.
Did you try?
Okay, the thing is, if you… I have a lot of students, Because,
They think they know this, and then they practice this question, And then the
homework question…
But then, the final is different. Why? I only change address, right?
But the way you briefly understand from solution, it won't work? Okay, so you
really need to pay attention to this. Okay, this is too small. Because if you have
the exclusivity, and everybody else doesn't have any of the tags, then we'll go in
that spot.
Is there any way I can make it bigger?
Oh, download. Download. Then where can I… over here. This one. Okay, so I can make
it… Sir, okay.
So the tricky part here… In the, question…
This question, 118, 100, 118, you can see the address here, right? Address here.
How about 109?
Okay.
How about 109? Or 101?
Okay?
Don't rely on the textbook solution.
You really… I will show you how I will do, and you should do exactly the same way.
So it will work for any cases.
I can change assumptions, right? So if you do correct way, anytime I change
assumptions, it will work. But if you just briefly go over the examples with simple
assumptions.
If I change block size, if I change the addressing, it won't work.
I will combine this question with the cache design. So, you always need to do what?
With a hexadecimal number, you need to change to binary, and then when I change
cache size or block size, the number of cache index bits will change.
Okay?
Alright, so it's… we will go through that.
So, you read this question, right? I don't have to read one more time. How many of
you attempt?
Okay, not all of you. So…
So, you can think here, is it directly mapped cache?
And, how many cache index bits you're gonna have with this figure.
2, okay? So you have two bits, okay?
Alright? 2Bs in the middle.
And though, how many Bs will be offset?
One sec…
So, this is MSI.
And then… So where is the block size?
Where? At the top. Where? Where, where, where? Four blocks each holding two words.
Okay, so you have four blocks, means two bits in the index, right? And they each
hold 2 words. Each word is
Full buys.
So, 8 pi's, right? So, you will have a 3 is 4 offset.
Okay? So, in this figure, you… you can imagine this is a first word and second
word, like that. Okay?
The, the tag, we want to have, like, all tags addressed as a tag, right?
But this gives you convenience to put the address.
So when we really do tag matching with a partial, this upper part.
101.1 work, right? You cannot find 101 here. Where is 101?
101 is… actually, this is 101. If you exclude Okay.
So… So you need to do exactly the same thing, for example.
I will give you first one, 1 18th. If this is a hexadecimal, so you…
change to binary, what is it? 1, 8, and then 1… And then 1.
Okay?
So, 3 is offset, 3 is offset, this is index spin. Can you see that?
So, index 11 means you should check 1-1.
The figure won't work if I change the assumption, okay? Don't rely on figure. You
need to do this step.
Alright?
The last two blocks. Data.
Is it… because we said that it can store 2 words, right?
It's succeeding.
So, so let's say it's, one word, and then, how about… 1100. This is a…
12, right? So if you have, 1112 means what?
Abcc. Your address is 11C.
11C, it will go here, right?
And then the, 11C is not…
Let's assume… let's assume… yeah, there should be another word. This is one word,
and then there are another word, I think. 11C should be in the same position, and
then the other, okay?
118 to 11E is the same word. We're just assuming that… 11E? 118, so 3 bits for the
block, right? Right, so if,
So it would be in the same… like, the data would be there. It would be a hit. Yes.
I was just asking for the data column, the two columns. Yeah, yeah, yeah, yeah.
Let's assume two words, this is the first words, and then this is the second word.
Okay.
But if I have… let me… let me go through this simple example and then think about
it, okay? Do you know what we are talking about?
So, you see only 118 here. If I change to 11C,
it will be still there. Like, let's say here. If a P1 has to read 11C, and you will
go here, it's a hit, because you have the rest of tech matched.
And the only offset is different, right? So instead of read 0018, you will have a
next word, right? Which is not shown here. Can you see that?
This… so… So, let me, let me think about it.
I changed the question.
11C, let's say… is this C? 1100. It's 12.
12 is C, right? ABC. So this is a… one wants…
0, 0, and then 1 means 100, and then 100. So if you look at index bid, index is the
fourth and fifth 11. You go here, right?
And then you… your offset is 100, it's not 000, so you are supposed to read the
second word.
Okay? Or right to the second one.
Okay? This is why.
It wasn't clear to me. Are these states per… if we have an associative cache, are
we… No, it's a directly mapped cache, this example. If we have an associative
cache, does this scheme…
do we… do we also… do we reserve per tag and index together as… Yes. So… so this
estate bid is per block.
So let's say if you have a two-way associative cache.
Only it means with the same index, you have a two-way, right?
Which means you have two tags, and then two save bits.
For each… for each pork block.
So, in this case… It's a direct line. You can have two different tags in the same
index.
are they, can you have two different tags? Any different tags, you can have in the
index, right? So, is that… is it possible for both to be exclusive in that case, or
is it exclusive based on the index only? So, okay, let's look at this example.
You can have a 1-1 here. As long as it's a 1-1, you go here, or the last fourth
one, right? So how many different blocks you can have? Different blocks, you can
have 1, 2, 3, 4, 5, 6, 7… 2 to the 7 different blocks go to the same location.
So, when a core… or when a block has exclusive, exclusivity, that means for any
possible tag at that index, it's exclusive, or it means for this exact tag? The
protocol…
If there's multiple caches, they each have the same copy of this cache elsewhere
with whatever data in each one.
And a particular block needs to be in exclusive or non-exclusive. Does that mean
that that particular block with that tag is exclusive, and another cache could have
the same index filled with a different tag?
And… Okay, so the maintenance of each local cache is the same as we learned for
single processor.
The cache coherence is between caches.
Okay, so I don't still get what do you mean by exclusive. Let's say this 118, you
have the copy.
And your tag is this one, right? Because it's 1-1, in your location, you go here.
Okay? I have a same way. 118 will go here.
Okay? And then I save a tag. It's separate.
But if you want to read, and I have an M copy, I check my status N, then I will
release it so that you can read.
That's about clearance.
So… there's a… there's the state for every block in the entire memory, and it's
inexclusive or modified? Yes.
So, okay, we only talk about this, the per… processor, UCMSI protocol, right?
The block you have here is I, or S or M.
In memory, you have a global, global status, right? So, in memory, actually, the…
this… you… you won't have any, state bit here, but you will see, okay, this is 100,
And then the… if it is M, if some S coming, then…
S read coming, the other has N, so this should be right back. You need to wait
right back happens, and then supply.
Okay, if it is S, you will see the read recast, and then you can supply, because
your copy is clean.
So, the modified and shared in whatever states, those are stored per cache. We
don't literally have two bits in every memory location, but every memory location
in any cache corresponds to a particular… Yes, yes. You can have also similar to I,
it's 0. So, when you have a read recast 4110, let's say, when you check not
None of a processor has an M or S. Then you can just go to supply. Only when you
have M, you should wait until right back happens, then you supply.
Okay, S or 0, okay, there is a corresponding state in the memory, then you will You
resupply immediately.
Okay.
So let's do a simple thing for it first, and then we… I can make it more
complicated. But I really want you to understand this part, okay? We…
spend so much time how to operate cache, right? This is caches still. Only we have,
on top of a cache, we have multiple caches.
So, page X are saying.
Okay?
So… Let's go to this first question, okay?
So… With this current state, P15, try to read 118.
And look at your state diagram, and what hap… tell me what happened.
The first question…
B15… Read… 118.
So, because of 118, We already done 100 and the one. This 2, you go the last one,
right?
And then you do tag matching, and then it's matched.
But it's I.
Okay, then what do you do?
Let me know, I don't have a state diagram.
Although I memorized. What happened? Look at… look at your state diagram, tell me
what happened. Look at your CPU.
You should put, read the myths.
Okay?
When you put readiness, others listen, right? What happened to this PGR?
Look at your state diagram. What happened? You read, you look, you cut the bus,
read recast, do nothing. How about S?
It's a tag match, right? But it tries to read, but you don't do anything, because
memory will supply value, right?
So, memory 118, so this will change… to… this… Let me change the color.
Where is color?
Okay.
The answer, this will change to…
S118, and then the value 0018, okay? They only show first the word. The problem of
this exercise, they never go to second word.
Okay, I want to make a second word example. How big is a board?
here is one, I assume one word, depending on the context, right? So here is,
8, 8, 8 bits, and so… Within 3 bits that we took for… So, 4 bytes is usually one
word, right?
It's like… 2 bytes of data.
Anyway… The question says that those two bytes, like, represent the whole four
bytes in the words.
So it, it, it shows everything.
Then you, when you, when you put this, change this, and then it will…
Give a 0018. That's what I struggle, because it asks 118, right? So then you need
to supply one word, but this is one word in the textbooks.
Okay.
So it seems there are other words are hidden. They never touch the other words.
Okay, so… so let's go to the second one.
The thing is, when we go to second, I will start over with the initial state, okay?
So I will delete everything. The final will be like that, okay?
So you will start second question with a clean one, okay?
So, second one, let me spin… Sweet.
Second question. Second one is a… P15RIGHT 100.
So, P, 15, right, 1, 0, and the value, 48.
Okay.
So, 100, again, you should do 00000, 1000, okay? And then these two, okay? So, you
go here.
And tag matched.
Is it tag-matched or not? No. Okay, so what do you do?
Right?
It's miss, right? So you need to put right miss.
Okay, this right miss, Listened by everyone, and they check.
It's a 100, but it's I, right?
Look at your Pinot state machine. What you are supposed to do when your eye…
nothing.
Nothing, isn't it? You have I, I don't have a copy. So you read, write, do whatever
you want, right?
Okay? So, then, 100, you go here.
And this will be brought, okay?
So, first, it will be changed to…
And then 100 tag, and then 0001st, okay? The data, this data will be supplied to
the… you write down all the actions, okay?
However, you are writing 48, so this is changed to 48.
Okay?
This is the second question.
So let's go to third one.
Let me… Is it okay to erase?
Yeah, he's exclusive.
You heard everything. It means it frozen, right?
I don't get exclusive. It means modified, which is exclusive state. Yeah, yeah,
yeah, yeah. Yeah, oh, oh, oh, yeah, yeah. So, MSI… in earlier slide, the textbook
says exclusive, but it is,
When we, in general, when we generalize the protocol, it is MSI.
Right back happens. We think that… Right myths?
Do you need to do right back?
Very good question. You have a 120 block here, right? Address 120. It was S.
It's favorite.
Shared means clean, yeah. Okay, you don't have to. Only N, you need to write back.
Okay. Okay. When you're reading that from memory… Read from memory? Yeah. In there,
does it all say that's M, or, it's because you're bringing for right, it will be M
from the beginning.
Okay. All right, so it seems like we need to have a class on Friday, okay? We will
have a class on Friday, and then we'll return today's class, okay?
See you on Friday!
Maybe Friday, we won't have… we won't start early. Okay, I will put in the
announcement. Our original class time starts at 3, right?
Yeah, if you're ready for it.
Dec 5:
There wasn't any points that… Oh, okay, okay. Yeah, yeah. I will, I will correct
that. I will give you credits, and then I will give you more time.
Can you make it after… after Monday? I have operating systems… Yeah, it'll be…
because the deadline for the Home of 5 should be…
the time before final exam. Oh, yeah. So, you should do it during… while you are
prep… you are prep… prepared for the final.
The answer key, they have with solutions. I think the answer was, like, 5, Right?
I want you to do… Yes. Okay, I want you to do, because this question will be in
your final.
Okay.
Yep.
I should evaluate it a little bit. But,
And then check. We have a downloaded one, right? Where can I find the download,
right? Here?
Okay, well, actually, no.
So it's just another one. Why? I need to sign in to open it?
Like, from Homework 5, where at the top it said textbooks, textbook exercises. No,
it didn't have the textbooks. It did. It was, like, 5. It's like, you obviously
didn't do this right? Yeah.
How can I?
Oh, then it stays. That's good. Yeah. See? If you drag the file to the, browser.
Okay, good. As long as I can see, it's fine.
Alright, so this is the status. Okay, let me repeat.
What if the address, let's say, let's see, here.
I can improvise the question. For current status, What if… B15… Try to write 114…
60, okay?
But what'll happen?
So, P15, look at…
this their own cache, right? So you need to decompose this into binary fields. So
for… And then Swan…
Wang.
And we discussed there are four blocks, so… and then the, two words, so each word,
we assume, is a full bias, okay, four byes, so this will be offset.
And these two are… Index, which means you need to go through from 00011011, okay?
So you go to 1, 0, okay? And then you do tag matching. Actually, tag is the rest,
right?
And it should match with this, because you need to convert to the binary. A match
is a hit. However, the
status, state B is equal I means you don't have it. So what happened?
And look at your transition, FSN.
MSI protocols.
from I. What happened?
If you have, Right miss, right? So you will put…
right, miss, and then you will change… you will bring from memory, and then it will
be changed to memory, right? So the action for this arrow, your own right, right?
The action is a bus…
Right miss. You put right miss on the bus, because you need to get it. So you put
right miss. Then what happened? First, this is the first thing.
And they will be seen by others. Look at that.
There are two, right?
They do tag matching, although it's 114, the rest of the tag matched with 110.
Is it clear? That's why you need to do this binary way, okay? The way we did for
the cache, so that you know 110 actually share same cache block. It's the same tag.
110, let me, see… 110 is this way.
And then these… offset this, the rest of this are tagged. Look at the tech.
It's the same tag, right? Tag matched. So this is a hit. Okay, from M, you see
right to miss on the bus. What happened?
Invalid. You need to do…
Invalid, okay? So, this will move back to invalid. So, if you see bus, right miss.
Then you need to do right back.
Right?
So, this will change to I, okay, simply.
Then D, you go to 110.
You will change to… one… 30… Okay, this is that.
So, the second…
Operation will be you see the right mist on the bus, and then you snoop, and then
you ride back.
Then, this updated one will be supplied to… There's 2, 3… Or…
Alright?
So then this will change to…
So, this line will change to Emma, right?
And then you are… actually, the tag is actually 1…
Let me just put 110, but this is a tag, you need to put 001000, okay?
To be precise, Then you got 0030, but you are writing 230.
1, 10, 60. So, how it will change.
You get 0030. This is the last 4 bits, first 4 bits, 4 bytes, and then later 4
bytes, one second word will be this one. So this will change to 60, so 60-30, okay?
You got it?
So if you see this kind of cache in the final, the address is not appear here. For
example, 1…
24. 124. Okay, 124, where it goes.
124, actually, same block with the starting address 120, because you have 8 bytes.
Right?
Then, if you are writing, for example,
P5, 15, you are riding 124, you are riding, let's say, 80, okay? And what happened?
You check your own cash.
What do you see? S. What does it mean?
It's a hit, but you need to put right miss on the bus, right? So you put right miss
on the bus, and then others snoop? Do you… do they have a 120? No, right? Can you
see that? 120 is not there.
So, it won't do anything because you don't have it. You change 2M,
Because S means you're having same copy as memory. This is up-to-date copy. So you
can just change to M, okay? And then you can put
80 here.
This is upper second word will change to 80, so you are… the request for this will
change to this, okay?
All right, clear?
This is what I can think of, a most complicated version. Your reference is not
always to the first word, it can go to the second word, okay?
Because when you do first word, and with a… you save a whole address as a tag, it
magically works, right? But actually, it's not the way real Snoopy protocol works.
You are having tag field only, and then tag matching actually happens.
Thank you.
So let me clean…
Do I need to do… so we start with the second one, right? We did the second one.
Correct?
And then let's do a third one. So, compared to this is the easier, simpler version.
It's non-zero. Okay? So, C says P14, writing to…
118 address A80, okay? So, you tell me what's going on. If you don't get it now,
you won't get it, right? So, tell me if you are not clear.
Where do you go? 118. You can see here, it's easy, right? But in case… so, just to
do…
conversion.
So, this is the B, 11, you go here, then, yes, tag matched, okay? Tag matched, and
then it's… but it's I. Okay, from I, you wanna write. We already done, right? From
I, what happened?
You put… if you're… Your own right?
But it's, miss, right? Right, miss on the bus, okay? So you put, right, miss on the
bus, what happened?
Don't miss this case.
You put 18 in your riding.
Which one? Both of them listening, right? All other. But what you will see the
change in terms of…
Shared? Shared.
It should be what?
Invalidate, right? So you have a copy, clean copy, but the other one will update,
so this is a no longer valid one, so you just change 2.
I. So you need to list out as actions, so P1…
The black third one will change to I, okay? You need to write down this thing.
And then this right miss is coming, then the memory one, 18, provide this 00, so
0018. Then you update the first word, 18 to…
80. This is the last one, okay? How about memory? Memory still has
0018. Okay, you don't update memory, because it's a writeback.
Okay, so you will have a first P1, P2, P1 change.
This to invalidate, and the memory supplies 0018, and then 18 will change to 80,
okay? That's all you need to write down.
Let me just write it in words. No, yeah, you can just, you know, write down the
steps. Yeah. Okay?
Because it says what's the final status of your cache, okay?
All right, if you're clear, let's go to the next one. Any question?
There's a us-ness. Okay.
All right. So, any volunteer for the last one? Last one is the most complicated,
maybe. P15, you have arrived 108.
8 pieces of words are 4 bytes? 8.
2… 8…
Normally, Little and Big End units is within a word. What's the order? What
happens? Tell me, every step.
Here.
So, 108, you will have 10000, you know where to go? 01, right? B1, you need to
check second row. You go here.
It's F. What happened?
It's a hit, but can you update?
Oops, go ahead. What do you need to do?
Look at your finite state machine.
When you have an S, What's the action, first action you need to do?
Yes, from S, yes, of course, you will move to M eventually, but your own bus ride
You should put bus… the ride miss on the bus, okay? That's very important. Why?
There are cases where other processors had another other S, right? So you put,
right, miss, and then what happened?
P0, you see 108, 128, so they do snoop, right? They do snoop, and match with their
tab.
only this is matched, so it's invalidated, right? So you need to write down P0,
change P1 field to I, okay? The rest, you can copy as it is. All we care is we
change this I. We don't do purging.
We don't clean up. That's, another…
security or tech point. A lot of times, when we invalidate, we leave the value
there, okay?
So, the attacker can use those information. We only change the status bit to I,
okay?
All right, then what happened? Right, miss, so you get 108, right?
Do you need anything from memory? No. So, actually, you just as immediately you put
the right mist on the bus, you go ahead to change this, what, M?
Okay? And then 108, tag, This value will be 0, 0, AED, okay?
Okay, but this is it.
How about if you…
So 8, 9, 8, 9, 8, B, C. So, if you're request this, I create this question. P15 has
right 10C, okay, 10C,
and, AED. On.
88, let's say 88 on the,
Let me clean out everything. Okay, this is the status. What happened? Tell me.
P1, I just come up with a new question, P15, so I create the E. You have only up to
D question, right? D, your quiz. This is E, I make it now. E, really? Okay.
I didn't break.
What's the E?
P1
P15, read 110. Then what happened?
110.
So, is I. What do you need to put? Read Miss on the bus, right? So, others read,
and then others
Is there any copy?
Okay, so let me hold this question, let's do that first.
P15, read 110.
Okay, so I'm trying to read 110, it's invalidate, so you put the readmiss, right?
And then 110, -oh, so there is an M copy, what happened? So this should change to
what? There.
Yeah. Yes. Think one more time.
Do you have an M copy?
Someone else tried to read, to read. Do you need… so, where you move, which state?
Is S or I?
S, okay? Because we allow multiple readers
exist, right? So this just changed to S…
But, you need to write back, right? Write back, 110. So, 110 will be… this 0030,
0030 will be right back to the memory. Memory will supply 0030, okay?
So this 110 is here, so it will be changed to S,
11030, okay? So this is the final answer.
Look at that.
There are 2S, And then they are same value, okay? You can have that.
Is it clear?
Okay, okay, let's go P the F, E, F, okay, F question I create. I will give you 2
minutes, do it, and you can discuss with your friend.
So P15 writing to 10C, value 88, okay? So I will delete the other parts, so we
don't need… 01, so it's the first cache. So this question on the clean cache like
this.
So 01 is the shared version of data.
You think you know, you understand, but if you don't understand every step.
Without any out, that's fine. You won't get any deduction during flight. Right? Oh,
no, no, this is for…
That was what I was imagining.
points to the last node. So I would…
attempt, I will give my own answer first, so that you can compare with mine. That's
the only way you can check if you really understand or not.
If you wait until I give you a solution, and then you nod, oh, yeah, yeah, yeah, I
understood, right? Then you won't get it right. Do it by yourself.
So we have a shared, and we're doing a write. So the shared value, and that one
does the right.
Oh, that's just how it's how you…
No, but it's also skipping, right?
That's the fourth point. You're saying that one student IP addresses are not shown.
Delete those, because we're not showing them anymore.
I'm assuming it's just 0.3 coming, right?
Okay, so where do you see the change?
So, fee 15. So what's the final answer?
So 15, 10C. Where is 10C? 10C means, C means…
1… 1, 0. Okay, very good. Because, yeah, it's 8 plus 4, right? Okay, good.
And then 1, 0, and then zero… The one…
So your index is 0, 1. So you go here.
Right? Is it hit or miss?
Excellent.
Miss? So, for this question, the final especially.
Don't rely on the number you have here. You need to convert to the binary, and you
need to put the tag only so that you do real tag matching.
So, actually, 108, you see the upper parts are the same.
Okay, that's a tricky question I'm going to put in your final exam, okay?
This is what we have done last, like, 4 months, right? To have offset, index, and
tag. This is how real cache works, even distributed, it's nothing same, okay?
So it's a hit, and it's S. What happened? It's a rhymes, you put rhymes, okay?
So, what happened to other processors? P0 has been down. P0?
P0 has an S, right?
So what happened? When you see, right, miss?
So you know someone else trying to write, right? Will write, update, so this will
be…
Invalid, okay. Just let I, that's it.
How about memory? Does memory need to supply any data when… because it has
rindeness?
No, because you have a right miss, it's a from S, so this, as soon as you put right
miss on the bus, you will update this M.
Then where? You put 88. Where? Upper word, second word. This is the first word,
this is the second word, so it will be 8808 will be the final contents of this
block. Clear?
Clear? Wait, why do you put 88 in the first one? So, this is how I interpret. In a
block, you have…
8 bytes.
Okay, 8 bytes, but we put… we use a decimal number, so there is, this 4… first 4
byt, from 0 to…
3 by, and then 4 to 7 is the later part. This is… so original was 008. When you
write to 10C, it means you are updating the
Second word, this is geological position. Do you understand?
Okay, so 00 will change to 80. Okay, textbook doesn't have this part. They always
change the first word.
And then I, you know, wanna make sure you understand how this works. We don't have
any always first word, right? And we don't share the address tag as it is. It's a
really tag field. It's different from address, okay?
Question? Could we write 16,000 instead of 88, and then that 88 would then be
16,000, and it works because of 4 bytes that are being written there? Let's follow
the notation, just to put 88.
Right, I'm saying, if the question instead said, write 16,000, would that be a
valid question? And if it was written, would that 16,000 go where the 88 just went?
Yeah. Okay.
So, all of these numbers, we're assuming, are zero extended forward. So, do you
understand his question? It happened, all the examples, it happened to be two
digits, but what about 1600? Yes, you can have 1600, okay?
All right.
So, Dan, let's talk about directory, okay? Because I know for your interview, you
need to know directory-based.
Because…
The first time, when I jumped into CMP, like, 20 years ago, we come up with some
idea, and then we test, and then we show the results, and then, of course, we use a
default
cash clearance protocol, which was MSI, okay? And then review why you use MSI. You
artificially create more traffic.
Whereas most of a real system don't use MSI.
Okay? So,
I will give you… I will share my final decision whether I will test on this or not,
but this class, you want to understand what is M-E-S-I, at least, and the M-O-E-S-
I.
For Intel, actually, it is ME…
M-E-S-I-F. They have an F, okay. So, if I were you, AMD, Intel, any…
general purpose computer design. You really want to
study cache coordinates protocol, directory-based, okay? Because, yeah, we have a
bus-based.
But any GeomFi, you know, the server level, they have a general network, and they
have a directory-based, and they don't use MSI, okay? The MSI, the reason I put in
the final, make sure at least you understand this.
And then when you have an interview, two days, you will study, right? If you
understand this, you can understand the other protocol, because it just adds more
state why, how, to reduce traffic.
Okay?
Alright.
So let's go.
And that's go.
Where was it?
Okay, so let me quickly read the performance, or you can read trajectory-based
first.
Okay, papomas.
Because we are computer scientists.
We all care about performance.
Amen.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we're going to discuss the performance of a cache
coherence protocol.
Before we discuss about the performance of a Snoopy cache coherence protocol, let's
talk about the complications of the protocol we just learned.
Note that the cash coordinates protocol we just discussed with the prior slide set
has a simple three-state protocol. Open time, it is referred as MSI protocol. M
stands for modified, S stands for shared, and I stands for invalid.
The complications of this basic MSI protocol is that the operations we assume here
are not automatic. For example, when you detect a miss, acquired buzz.
and receive a message response. It takes a long time, it cannot be implemented as
atomic operation. So it will create possible deadlock or race conditions. So one
way, simple solution for now, you can assume, is a processor
Since invalidate will hold the bus so that other processors cannot do anything,
because you… then you can't guarantee atomic operation for that invalidation. You
will hold the bus until other processors all receive the invalidate.
This MSI protocol have many extensions, okay? But here, we will only talk about two
basic extensions. One, MESI, and the other, MOESI.
Okay? So MESI has the state exclusive to the basic MSI protocol, yielding four
states, modified, exclusive, shared, and invalidate.
The exclusive…
Kim, Eun J
Kim, Eun J
[Link]
I would Google MESI, I would try to throw MESI by myself. So, you can throw MSI,
right?
Just,
If it is too complicated, you can think about your own action only, okay? MSI is
simple, right? From I, if you are right, you go to M. If you're read, you go to S,
right?
And then from M, any… you hear someone else read, you go to S, and then someone
else tried to write, you go to… you can think about it, right?
So, MESI, it will explain, but you can think, like, this way. There are special S
states.
If it is a first S, you will put E.
Exclusive.
Think about what's the benefit of, E state.
S means there are many S sharers.
E means you are the only one who has clean copy.
So…
With this kind of thing, during interview, you never thought about it, but then you
need to rationalize, right? So in terms of data operation, there are two kinds of
oppression only.
Where is it, right? You won't have to recall. You will have a read or write, right?
So, if you have an E, and you keep read, it's the same as S, right?
How about if you're gonna write?
How it is different from S only? You don't need to invalidate, right? Because you
know you're the only one. You can just go to
Right, and then change to M. Can you see that? Okay? That is the thing, you can
reduce traffic.
Okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This save state indicates that a cache block is resident in only a single cache,
but it is clean.
So, if a block is in E-State, it can be written without generating any
invalidation, so you can reduce the number of lost transactions.
When a read miss to a block in E state occurs, the block must be changed to S state
to maintain coherence. Because all subsequent accesses are snooped, it is possible
to maintain the accuracy of this state.
The advantage of adding this state is that subsequent write to a block in the
exclusive state by the same core doesn't need to require bus access or generate
invalidate. Since the block is known to be exclusive in this local cache, the
processor merely changes the state to modify.
This state is easily added by using the bit that includes coherent state as
exclusive state and using dirty bit to indicate that a log is modified. Intel i7
uses a variant of MESI protocol, it's called MESIF, which adds a state forward to
designate it, which sharing processor should respond to a recast. Remember, in
SMI protocol we just learned, we assume memory will respond, right? So here, Intel
tried to get more optimized result by local core respond to that request.
It is designed to enhance the performance of a distributed system.
Another variation, very popular…
Kim, Eun J
Kim, Eun J
[Link]
But Intel, okay, because you guys… Intel still is the biggest company on the CPU
design. So, Intel has M-E-S-I-F.
F. Okay, so in your… think about your MSI protocol. What happens whenever you have…
you put the miss, the memory supply data, right?
Even your S. S means I have a clean copy, but you are quiet, you check, or you are
going to read, you check, you don't forward the data to the requester. Can you see
that?
Right? But… but Intel?
has F state. If you have F, and then you… you listen, readmiss, and then you will
supply the data, okay? Especially if you are E, you are the only one, right? You
supply. Can you see that? From E, if you see other reader request comes, you
supply, okay?
Exclusive will supply. The other four changes is state to F. What does it signified
to the other codes? What is the purpose of this F? You can supply the data with…
Yes, yes. So F, the state with F is in charge of forwarding.
you have a designated, you know, the reader among all the S. Right. Isn't it, like,
So, the purpose itself is just to forward, right? Yeah. The data line… the status
of the line is still shared. Yeah, it's still shared. Then what is the purpose of
needing a new
state. You can always assign a forwarder in the… You can reduce, get rid of a
memory transaction, because it's only bus transaction.
Instead of memory supply the data. S means memory has the same, right? The earlier
one.
the other chords with S be quiet.
Isn't it? But then here, you… you forward, right? And then it will be faster.
Okay, what I was saying is that that is just a sharing process, right? So only the
memory or the bus needs to know that the other core is sharing the data, so memory
need not share the data.
Like, what is F, right? Then why do you need the F state for the memory? Like, you
don't… like, who's reading that F and then deciding based on what do they… I need
to double-check, but then, you know, we need to have a… there are many sharers. You
need to have one designated sharer.
forwarder, right? You… not all sharers will… 4.
So even E, from E, you see the second read recast, right? Then you forward your
data.
Right? And then you change to F, okay? Yeah.
Alright.
So this is the one thing. Maybe if I decide to put MESI, or MEOSI, or MESIF, it
will be extra point, okay?
If I… if I added those questions.
is M-O-E-S-I, which has a state-owned 2MESI protocol to indicate that associative.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Block is owned by that cache and out of date in memory.
So in MSI and MESI protocols, when there is attempt to share a block in the
modified state, the state is changed to shared.
And the block must be written right back to the memory. But in MOESI protocol, the
block can be changed from modified to owned state in the original cache without
writing it to the memory. So it will reduce the bandwidth to the memory.
AMD use this kind of protocol.
Kim, Eun J
Kim, Eun J
[Link]
So, M-O-E-S-I-O, you can think special form of M, okay? So, the M-E-S-I-E is a
special form of S, right? First to S.
But M-O-E-S-I, O is a special form of M. So when you are M, you see read request on
the bus. Then what we are supposed to do, we change to S. Instead of
change to S, we change to O, okay? Then… and I forward the value to another share.
Can you see that?
Okay? It's similar to M-E-S-I-F, you need to figure it out, it's your homework,
okay? In case if I put in the final. But MOESI, you… instead of going to S, you
will go to all. So all means
There are other readers.
Okay, I'm also supposed to read only, but when this replacement happened, what I
need to do? How it is different from S.
S means… is the same as memory, isn't it?
Clean copy. You only read, but memory has up-to-date copy. All means I'm reading
it, but I have a
30 copy, okay? Up-to-date one. So when you replace.
That should be right back to the memory. When you have S,
Replacement happened, what do you do? You just override. You don't need to write
back, because memory has up-to-date copy. Can you see that? The replacement
happened that way.
meaning good things?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
As you…
Kim, Eun J
Kim, Eun J
[Link]
Similar to exclusive?
Exclusive is the same copy, you don't need to write back.
Exclusive is a special form of S. Means S is a clean copy, why you are supposed to
only read. You have the same copy as the memory.
O is from M. N means modified, so it can be different from memory, right?
Okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
know the SMI protocol, Snoopy-based cache coherence protocol, based on symmetric
shared memory multiprocessors system, like shown in this figure.
That there are limitations in symmetric shared memory processors, and so in
snooping-based protocol.
Because as the number of multiprocessor grows, or memory demands of each processor
grows, any centralized resource in the system can be a bottleneck.
multi-cores.
A single shared bus becomes a bottleneck with only a few cores. As a result, multi-
core designs have gone to higher high-bandwidth interconnection systems.
We'll discuss it later.
Another thing we can think, how to increase SNOPI bandwidth, okay?
So, being bandwidth at the cache can become a real problem, because every cache
must examine every miss, and having additional interconnection bandwidth only
pushes a problem to the cache, right? So instead of a bus, maybe we can have a high
bandwidth interconnection, then your cache becomes a bottle.
We check duplicate tags, one for CPU cache check, the other for snooping. This
doubles effectiveness of cache-level snoop bandwidth.
The other solution we discussed before was you can have multi-level caches, and if
our most cache on multi-core is shared, we can distribute that cache so that each
processor has a portion of the memory handles snoops for that portion of address
space.
This…
Kim, Eun J
Kim, Eun J
[Link]
Okay, tricky question for the interview, or I use when I interview PhD students.
Cache coherence protocol resigns, resign.
be to win private and share the cash. So, you can think…
The basics of a cache coherence problem coming from Replication, isn't it?
when we allow multiple copies going on. So the first place you… we allow multiple
copies, that is where you need the coherence protocol. So look at this figure.
Let's say, so it says one or more levels. Let's say it's from 1 to 3, it's a
private cache.
Okay? Private cache. And then your last level cache, level 4, is a shared.
So shared, and then it is a banked way, right? So do we have a duplication,
replication in this last level cache?
Shared means you have an address, and then, let's say there are four banks, the
middle 2Bs will indicate where to go.
Do… can one cash block can be in two different places? No, right? Bank to one.
So, the Intel Jion server also works like that, that it is a bank, you have only
one place to go, which means there is no replication. You don't need
Cache coherence protocol for this level, because you don't have any multiple copy.
Cache coherence means when you have a multiple copy, what to do, right?
Can you see that?
Only between this last third level cache and fourth level cache, first time you
allow duplication.
You need protocol. Can you see that?
So whenever you have a read-write on the last third-level cache, you go through
this protocol.
And between 1, 2, 3, you are having duplicate already. Between here, you are having
same copy, so you don't need the cache coherence block. So, let's say…
If I change this banked cache as a bank memory, I get rid of it, okay? I get rid of
it, and then you have a level 2 private, and then level 3 is shared, or everything
is up to 3 private, okay? You don't have a fourth level cache.
You're…
Cache coherence protocol resides between memory, because memory, you don't allow
multiple copies, you don't need to worry about it, right? And the last level cache
is third, where you have seen the copy first time.
Can you see that?
It's a very tricky question. I found a lot of students don't have this picture, why
you need the cache coherence protocol. Cache coherence requires… needed because we
allow multiple copies, so then you… when you have a private cache and shared,
shared, we don't have multiple copies.
Look at this, with a bank, you have only one place to go. We don't have a copy. But
private means you have a multiple copy, then you need a cash coherence, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This approach used by IBM, 12, Core, Power 8, these are new-core design, so they
have a distributed cache, and they can effectively scale the snoop bandwidth at
level 3 by the number of processors.
The…
Last solution we can think of is we can place a directory at the level of
outermost, lowest shared cache.
For example, if you have a level 3 cache, so the last cache, level 3, can act as a
filter on snoop requests and mostly… but however, it should be inclusive with the
higher level cache.
The use of a directory at level 3 means that we need not snoop or broadcast to all
the L2 caches, but only those that the directory indicates may have a copy of a
block.
So just as L3 may be distributed, the associative directory entries may also be
distributed.
Kim, Eun J
Kim, Eun J
[Link]
We will talk about more directory-based, okay?
I want tests on the exam, but because a lot of systems, AMD, Intel, like, when you
have 8-core, 16 cores, they have a more general interconnection, and the last level
cache is, like, a banked way, so…
I told you, cache coherence resides between this level, right?
So, let's say you have, oh, I need the certain cash, and then between this, there
is no bus.
So, how we get that information, which was available on the bus?
from directory, okay? So directory, you can think this memory is divided into four,
so each share… this bank zero, they have their own directory, they have their own
directory separate. So from address, you know which bank might have this data, you
go there.
And there is a directory. Directory will tell you it's S or M or I.
In terms of status, it's the same, whether it's a directory or a bus, okay? But the
operation is different.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
This approach is used in Intel Geon server, which supports from A to 32 cores.
In a multi-core using a snooping cache coherence protocol, several different
phenomena combine to determine the performance.
Especially, the overall cache performance is the combination of the behavior of
uniprocessor cache, missed traffic, and traffic caused by communication, which
results in invalidations and subsequent cache misses.
Changing the number of processors, cache size, and block size can affect these two
components of miss rate in different ways.
We talk about different Cash misses. One, capacity, compulsory, and the last one is
conflict, right? 3C, misery.
We will add one more here. C. Coherence misses.
The coherence misses caused by multi-core communications, okay? So there are two
resources… sources. One, through sharing misses.
Which comes from communications of data through the cache coherence mechanism.
In an invalidated-based protocol, the first write by a processor to a shared cache
block causes an invalidation to establish ownership to that block. Additionally,
when another processor attempts to read a modified word in that cache block, a miss
occurs, and then reserved block is transferred.
Kim, Eun J
Kim, Eun J
[Link]
Both these misses are classified as a true miss.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
Because they directly arise from sharing of data among processors.
The second one is called forms sharing. Arises from the use of invalidation-based
cache coherence protocol with a single valid bit per cache block.
Force sharing occurs when a blog is invalidated because some words in the blog,
other than the one being read, is written on. If the word written into is actually
used by the processor that received invalidate, then the reference was a true
sharing reference, then
it would cause a miss independent of our block size. However, if the word being
written and the word read are different, the invalidation does not cause a new
value to be communicated, but only causes extra cache misses.
Kim, Eun J
Kim, Eun J
[Link]
So, if we have the same code, can we change architecture, or cache size, or block
size, or set associates, whatever? Can we reduce through sharing misses?
True sharing means what?
You share variable.
Okay.
You share a variable.
Can we avoid that miss? So you are writing, I'm writing, alternatively, all-time
miss, right? MSI protocol, isn't it?
So, tool sharing, we can't do anything.
Because it's a program characteristics.
How about force sharing?
I change all the time A variable. You change variable B, but A and B happen to be
in the same block.
Right?
So, all the time you change B, or you change A, although there is no true sharing,
you keep having misses. Can you see that?
That can be avoided help.
You have… you have a less chance of, force sharing with a smaller block, right? So,
that was the idea, okay? I will, skip the rest of it, you can listen.
And then… let me start the directory, and then we have a Monday class too, right?
I can finish them all.
Okay.
And the quiz 32, I want to release, okay? I want to ask you to do.
So, in terms of state transition, exactly the same, but you really need to
understand, instead of you put while you're
Why?
Why are you left?
We have a Monday class. Monday, right?
Online, right? Online.
Monday is a refinery, so I'm supposed to teach, actually.
Oh, you don't want to have a class? I won't be happy.
Okay, we have a class.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
With this set of slides, we will discuss directory-based coherence protocol for
distributed shared memory systems.
Kim, Eun J
Kim, Eun J
[Link]
So, here, imagine this. There is a centralized directory. There is a directory. You
keep track of who is shared, who has a, like, your bookkeeping, okay? So whenever
you have readmiss, where do you go? You go to directory. Can you see that?
Okay? You put the read recast, misrecast, and then that will be forwarded to the
directory. And then directory finds who is M, okay? Then M, you will get the block,
and you can forward to S.
If it is S, what you do, you can get it from memory, or if it is F, you will get
one of S and forwarded value. Can you see that? So there is a 1
centralized, one place to go, and that, where it arrives, that will serialize the
actions, okay?
acts like bus, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
DSM.
As we discussed, a snooping-based protocol requires communication with all caches
on every cache miss, including rights of potential shared data.
The absence of any centralized data structure that tracks the state of caches is
both a fundamental advantage of a snooping-based scheme.
However, In terms of scalability, when you have more than 8 processors, this is a
really bottleneck.
It eliminates scalability.
The alternative to a snooping-based cache coordiance protocol is a directory-based
protocol.
Directly keeps the state of every block that may be cached.
Information in the directory includes which caches have copies of the block.
Whether it is clean or dirty, and so on.
Within multi-core, with shared outmost cache, it is easy to implement a directory
scheme. Simply keep a bit vector of the size equal to the number of cores for each
L3 block.
The bit vector indicates which private L2 caches may have copies of a block L3.
And invalidations are sent only to those caches. This works perfectly for a single
multi-core if L3 is inclusive, and this scheme is one to use in Intel i7.
Kim, Eun J
Kim, Eun J
[Link]
Okay, so here. When someone put the read request or write request, everybody
listened, and the invalidation happened there on, right? You're a very proactive
way, each processor. So now we don't have that system.
So recaster will put, oh, I need to read, I need to write, recast. And then it will
arrive to the directory. So then directories see what's going on. Oh, there are 3
shares. Now, you don't have to broadcast.
Bus was broadcast-based, right? Here, we have a point-to-point communication. I
will send the individual request to the invalidation to the sharers. Okay, someone
tried to write, oh, you have a copy, you have a copy, you have a copy, I know, I
send the individual request.
And then, you know, if we want to provide an atomic way, I wait until all the
invalidation, not acknowledge coming back.
I know 3, request gone, and then invalidated done, 3, then, okay, then I… I will
grant, okay, EJ, you can…
now you can change. Can you see that? That's the kind of things, a strong,
sequential consistency we can provide by having every acknowledgement come back,
and then you can grant to… to the update value. But if we relax it, what we do.
Meantime, you can let it go, okay? Then there are some race conditions that happen.
Okay. Do you know how you… did you take, The network class internet?
Ethernet bus… Ethernet is also bus, right? So what they are doing with… in terms of
arbitration.
It's the same as a bus, right? Our system bus. You're taking.
class, right? So, when I try to use a bus, oh, it is BG. Then what happened?
Collision, or CD, what is… what was that, TSMACD. CSMACD, yeah. So, what is it? You
do collision happen?
And then you block, then when you next time, try to use a bus.
You… random, or you do double all the time? After… It's randomly doubling. Yeah,
randomly doubled, right? So that you… you avoid collisions happening again.
So, this… the… I used to write proposal, because my… my research area is for
general interconnect and not bus-based, right? So I always need to…
persuade, because most of our systems use bus, right? But then, why we need the
mesh, why need, like, a Taurus, bigger network? Because, bus, it has a severe
scalability problem, and the long… it takes a long time to grab the bus.
So it's the performance bottleneck, actually. So when we have more than 8, 16
cores, you will use a more generic network, and then you lose a protocaster medium.
So we need to have a directory, centralized.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
[Link]
The solution of a single directory used in multi-core is not scalable, even though
it avoids broadcast.
The directory must be distributed, but the distribution must be done in a way that
cache coherence protocol knows where to find the directory information for any
cached block of the memory.
Easy solution is to distribute directory among the memory so that different cache
coherence requests can go to different directories.
Just as different memory, Lucas goes to different memory banks.
If the information is maintained at older cache, like L3, which is multi-banked,
the directory information can be distributed with the different cache banks,
effectively increasing the bandwidth.
Oh, this.
Kim, Eun J
Kim, Eun J
[Link]
So, okay, let me stop here. If you look at here.
what it says, let's say you see there are 8 cores, 8 cores, and then you see the
memory, the… let's treat this memory as a… it can be memory, or it can be… last
level cache is a multi-bank, the 8th bank, okay? So.
let's say your processor is here, you are here, you have, readmiss, you check your
own private cache, you don't have it, okay? Then you need to find the directory. So
directory also distributed, that's what I want to explain. Then you see the… there
are 8
Memory, and 8 different…
directory. So it's… it works like a multi-banked way. So you will use this certain
portion of 4 bits in your address, which tell where is the directory. Can you see
that?
So, the bank number is 000, this will go directory 0, and 0001, directory 1, okay?
From address, you have a designated directory you should search for.
Okay? Then this directory will supply the information on that design banked memory,
okay?
That's how Intel does, okay?
I will stop, and then I will see you on Monday. I gave the two final exam questions
today, okay?
If we're seeing compared to COVID-19.
Dec 8:
Sure, but if you want to just optimize things, like, some random benchmark.
This is about where I thought it was.
Yeah.
Very nice, about 100 pounds. And then 100 pounds will be fixed.
Peace. Yeah. Yeah.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
00:57
With this set of slides.
Kim, Eun J
Kim, Eun J
01:00
Yeah, we were supposed to fix the bug, so…
Good afternoon! So, this is the last class. Yoo! I'm so happy.
Okay, there was a change that, I was going to have a special office hour on Friday,
but I moved on Thursday. I found I have a conflict on Friday.
Evening, so Thursday, this week, Thursday, from 5 to 6, I will have office hour
through June, okay?
prepare some questions, and then I will go over. If you don't have a question, I
will adjourn the June…
office hour quit, okay? I won't wait whole one hour. So if you have a catch-on,
arrive there at 5 p.m, okay? Then we will go over your questions.
And then I have a day to finalize, to final exam, for final exam, so that would be
good to have an office hour one day before, because if I have an office hour Friday
evening, I don't have time to deflect
your questions to final contents, because Monday is your final exam, right? So
Monday, what time to what time? 10.30 to 12.30, right?
Can you text? And they're here, right? 10.30 to 12.30. And I try to be here by
10.15, and
We will start very on time.
And if you need any extra time because of your special accommodation, let me know,
email me one more time, just to remind me, okay?
We'll do that.
Anything else?
Okay, so, I… we have a final set of slides I need to finish.
So let's do that.
Okay, so to run this…
So… I told you, Snoop-based protocol.
pull right back policy, you learn, that will be in the final, right? But the… in…
real…
machines, there are a lot of directory-based, because we don't have a small-scale
multiprocessor, we have a large scale. And the way it works is the same, just you
need to send the request to the directory.
Instead of you just broadcast on the bus, okay? So let's just recap what you know,
and then quickly move on.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
03:40
we will discuss directory-based coherence protocol for Distributed Shared Memory
System, DSM.
As we discussed, a snooping-based protocol requires communication with all caches
on every cache miss, including rights of potential shared data.
The absence of any centralized data structure that tracks the state of caches is
both a fundamental advantage of a snooping-based scheme.
However, In terms of scalability, when you have more than 8%.
Kim, Eun J
Kim, Eun J
04:09
We, we discussed about this, right?
Okay, so I think… yeah, I do remember, I discussed about this, and then we can
start from refreshing on this status. It's the same as SIN.
SMI protocol, you learned, okay?
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
04:26
for directory protocol. First, handling a read miss, and the other, handling a
write to a.
Kim, Eun J
Kim, Eun J
04:31
Sure.
Audio shared by Kim, Eun J
Audio shared by Kim, Eun J
04:32
Clean Cash Block.
To implement these two operations, a directory must track the state of each block,
With the three states.
First, share.
One or more nodes have the block cached, and the value in the memory up to date.
So you need to have a set of nodes to share, so we call it share a list.
Uncached, No one has a copy of the cash block.
Modified means exactly one node has a copy of a cache block, and it has written the
block, so the memory copy is out of date. The processor is called the owner of the
block.
Directory maintains a block space, and the send invalidation share messages.
Before introducing the protocol state diagram, let's look at the catalog of message
types that may be sent to BigGen processors and directories for the purpose of
handling misses and maintaining coherence.
Here, local load is the node where a request originates.
The home node is the node where the memory location and directory entry of the
address reside.
Physical address space is statically distributed, so the node that contains the
memory and directory for a given physical address is known.
So, for example, in multi-bank memory, you know the certain portion of this will be
used as a bank index, so we know with that, we know which memory bank we should go.
The local node may also be the home node. Directory must be accessed when the home
node is a local node, because copies may exist in third node, which is called
remote node.
A remote node is the node that has a copy of a cache block, Either exclusively or
shared.
A remote node may be the same as either the local node or home node.
In such case, the basic protocol does not change, but inter-process messages may be
replaced with intra-processor messages.
Let's look at this table.
Here, P is requesting loader number, A is requested address, and D is the data.
The first three messages are requests sent by the local node to the home.
The 4th through 6 messages are messages sent to a remote node by the home when the
home needs data.
To satisfy a read or write misrecast.
Data value replies are used to send a value from the home node back to the
requesting node.
Data value writebacks occur for two reasons. First, when a block is replaced in a
cache and must be written back to its home memory. And also, you know.
Reply to fetch or fetch invalidate messages from home. Writing back the data value
whenever the block becomes shared simplifies the number of states in the protocol,
because any dirty block must be exclusive, and any shared block is always available
in the home memory.
Basic states of a cache block in a directory-based protocol are exactly like those
in snoop ping protocol.
So you can easily follow.
We can start with the simple state diagram, like a SMI protocol we've seen before.
This shows the state transitions from an individual cache block, and then we will
examine the state diagram for the directory entry corresponding to each block in
the memory. So, as you can see, the left figure shows actions to which individual
cache response, and the right one is what happened in the
Directory, in response to the messages received.
Let's first look at the left one, which shows the actions to individual cache
response.
Here, the notation is.
Kim, Eun J
Kim, Eun J
08:08
So you can open up the drawing you have, right? You… I ask you to draw, that's
important. So you can compare, but these are exactly the same.
So when you have… you try to read, you check your cache index, and then tag, it is
invalid what you do.
Actions are different. Instead of you put…
readmiss on the bus, now you send that request to the directory, where is the
memory? You send it the memory. The memory will give you data, right? So then, of
course, no matter what happened, you will change to S, isn't it? You didn't have,
you're having the block for grid.
How about write? Same thing. You send the request, no matter what, we will talk
about all full protocol, but from invalidate, you will have an M, isn't it?
Okay? From M, you are writing, then what you need to do. You, you can write, you
stay, right? But you should let…
Memory knows, right? So it's a heat, you just do it. Read the heat, you just do it.
From here, you see the… the directory will send you, oh, there is another, you
know, reader comes. Then you should change from here to here.
Right? You need to write back.
It's exactly the same thing. The only thing is, instead of bus, we have a
centralized directory.
Okay? The directory will initiate. So, for a directory, for example.
From uncached, when you have a request for the read, it will change to shared. And
it will be still shared if other requests are all read.
But whenever you stay back, what do you do? You add the sharer list. You update the
sharer list.
So here.
On the bus, when you put the README, others can listen, so they do update,
revalidate, right? Here, we don't have any way to know. So directory should
broadcast, but broadcast is very inefficient, because if you have 64 cores.
Maybe the number of sharers are only 4, you don't need to broadcast, you don't need
to send, oh, invalidate this block for everyone. You just send the individual
requests to the set of sharers, so we keep track of a sharer list.
Okay? So we… the… in the interconnection, we don't call broadcast. This is a
multicast. You have multiple destinations, and then based on the destination list,
you can come up with an efficient routing algorithm. That's one of our… my old work
we work on, yeah.
Okay, so as long as you keep get the readiness, you stay there, okay?
How about here you list use the, you know, right? Recast comes?
then you need to invalidate. You need to send all the invalidation to all sharers,
and this will change to M. Then make sure you…
Record who is, who is the owner, okay?
So Tessa, everything happened.
It's not difficult at all, right?
Any question?
So with this scenario, you can think of why we… last class, we talked about M-O-E-
S-I, right? M-E-S-I. So here, the…
is MSI protocol, we have one more E state. E is first time when you move from
invalidate and the sharer. First sharer, you go to E instead of a sharer. And then
from second, third iteration, it's more because you go to shared, okay?
Why?
From E, if you are writing, you don't need to talk to directory, you can just go to
M.
Right? And then you send it. So you… it will be faster, and then less number of
traffic. So one time, our paper rejected because we use a default
cache coherence protocol provided by Gen 5 Simulator, which was, I believe it was
MESI or MSI. And then the reviewers are like, why you use this one? And then why…
most of a common thing is MOSI or MESIF.
There is a very simple neglection.
I learned the hard way.
Right? Because, this, advanced, cache coherence protocol
reduce number of traffics, and they're kind of suspicious that we artificially
increase the number of traffics to show our benefit, right? So, that's, things
people can really scrutinize your idea, okay?
So these are all explanations. I don't think we need to, go through more detail.
So, we will spend more time on the review, okay, if you don't have any questions.
Okay? Because this is very repetitive with the Snoopy, okay? You always go to
directory, and then directory have a maintain of owner who, like, if it is M, you
know who is M, right? So you, you tell M,
Owner, write back, if there is a read request.
If there is a right request, what you need to do, same thing. You should request
right back. So then I provide, modify the one to the new owner, and then I update.
So think about the directory rule. Now, what if a read recast comes if it is
uncached?
It's the first time, right? So you move… you record this is from U to S.
And then put the recaster name there as a sharer.
One thing, different is…
let's say you ask to, you know, read, and then when I look up my directory.
Anna has, but M.
Okay? So, if we follow strict MSI protocol we learned on the bus base, what do we
need to do? I let her invalidate, which means write back to me, okay? And then I
relay to you to read, right?
But then, the… the intel, what it does.
Because me, you can think you are very close, sit by each other. So you can just
quickly, you know, look at her note.
Whereas I'm here, so she needs to walk to me, and then I copy, and then I go to you
and copy. It takes time.
However, if your together own chip is much nearer, so she can forward the data
quickly than the memory supply. So that kind of things we will do. So, the
directory can tell owner to forward the block, okay?
And then, at the same time, you need to also send your blog right back to me. Why?
Because it's a sharer. Sharer means… share the state means the… all the cut…
Cash copy in the… out there.
Should be the same as the memory contents.
Okay.
So those are updated versions, you can imagine, okay?
Okay!
Do you have any question on directory-based protocol? This is the most popular
question you're gonna get.
I'm sure.
If you interview with AMD, Intel, Microsoft, Facebook, whatever, hardware, CPU
design part.
Maybe if you design… if you interview with a…
machine learning accelerators, or systems for AI. It's a huge field, right? I need
to create a new course of systems for AI.
Right, so we have all different topics.
Unless we work on the accelerator design for AI, this CPU, all… we have
multiprocessors. In multiprocessors, the communication, we work together,
communication happens through this cache coherence protocol. It's very important.
Okay?
Alright, so if you don't have a question, let's… Wonderful.
Yes, exactly the same. So, only thing, when you… for example, you want to write,
when you check your copy, it's S, let's say.
What does S means? It's the same implication. There are other copies, maybe, right?
Yes, so on the bus phase, what do you do? You have a… you want to ride, you have a
store, and your PC fetch, and then store, or then you calculate address.
With the address, you check cache, your cache is S, but you… your instruction is a
store, means an update. Then what you need to do?
I'll show you one. Yeah.
Or I have to… In that street plan.
Nope.
you need to put right mace on the bus, isn't it? Because you have a share.
data to read, but you don't have data to write, right? So you put the right mist on
the bus.
So, that action, you can translate it to… Talk to directory, right?
So, exactly same thing.
So you can put the right miss on the bus, now you send the right miss to the
directory. So replace bus to the directory, then everything will be fine.
So it will reach… so the directory base, usually the intermedium, the medium for
communication for on-chip is not bused anymore. Maybe it will be mesh, you know,
point-to-point.
Then you need to go to memory controller where directory is, it will be multi-hope
travel, it will arrive, and there is a…
right miss, right? I receive a write miss from node 001.
Okay? Then I look at directory. So there are…
Only one case, because his status was S already, right? So it should be S to me.
What I need to do?
I look at my directory. This blog is S.
I should grant her his request, right?
So that he can write what I need to make sure.
a validation to all sharers? Yes, you need to check a sharer list.
If he's the only sharer, I don't need to do anything else, right? If there are
other two people share this blog, then I… what I need to do?
I need to literally send invalidate message to sharers, okay? That's different.
On the bus, on the way, right, misdelivered to the memory.
You share a bus, everybody listen, you know. But there is no way you know, oh, he's
going to write. These two people never imagine, right? So directory got that, oh,
you have a copy, but he's going to update, so I will ask them to invalidate.
Okay? To be more correct.
I will wait until acknowledgement coming. I send two invalidate messages to each
one of the sharers, and they will reply to me, yes, I invalidate.
Then, until I got both invalidation and acknowledgement.
I wait, then both of them deliver, I send him acknowledgement, okay? You can go
ahead to update. I change M, and then he can change M. So we kind of have a global
point of synchronization, okay? And it will…
degrade the performance.
Because he, okay, he sent the right miss, he can update, he can go, right?
Right? Meantime, these are not invalidated yet.
So, even he changed to 10, you know, the new value, but these two, still, there are
chance, because they are multi-hop travel, and the multi-hop travel to send the
invalidate message, right? So, so that's, the optimization versus the correctness.
So, a lot of times, a hardware designer, we break correctness a little bit.
So that we can boost the performance, because most of the cases, it won't cause a
problem, only very odd case, but we believe if there's a special case, the software
you, user, will put
explicit synchronization. I need to wait until this has been done, then I can move
on.
like a semapol barrier, the barrier, and…
That was the software tool you should use.
Okay.
Okay, so if we go back to the module…
The earlier one, would be sweet.
Usually, I will put…
Tomasolo wore 100 Speculation Kirsten in the final again, but I promise I won't,
right? I won't.
Let me think, because we do SIMD, right? So you learn SISD, only the instruction
level parallelism, one stream, one data, you learn
But then we learned SIM, the exact same code we have.
DX plus Y, right? The vector, we change to vector processor.
And then we also learned the multiprocessor, right?
So I try… I will try to come up with,
check basic understanding of difference. These, you know, the…
Because your main topic is MIMD and SIMD, right? You need to have a solid
understanding of SISD.
Right? Hardier speculation, out of order.
Execution.
So let me think, and then Thursday, maybe, or Piazza, if I finalize my thought, I
will let you know. It won't be that…
surprising question. Just when you describe something, maybe you can relate it to
hardware speculation or Tomasolo. I think we briefly discussed when I introduced
SIMD, but…
Let me come up with a concrete example of how I can ask, okay?
But other than that, other than that, the materials before me turn won't be in the
final, so good for you.
So, alright. So, memory hierarchy,
Basically, you need to go over all the quiz questions we do, and then all the
homework questions, including homework 4. So the thing you failed for the midterm
was you neglect homework 2, right? I told you homework 2 will be there.
GAP notations I told you we would do. This time, again, RIP, how vector related,
you know, affected replacement policy, you need to study, okay?
Okay.
So that's, I think… what I… Plan to put.
And the other… let me… Do we have any note? How do I have, empty paper?
Do you know how to open the pad, whatever, white?
Oh, okay, okay, I can do here… there is a window, right?
So, more options, you might find a whiteboard. Whiteboard, okay. More… Where is
whiteboard?
We used to have a whiteboard, right?
Yeah, you need to stop sharing and share again, go on. Yeah, yeah, thank you.
So… Then… Sure.
share? Well, just who shared,
What?
Okay.
So, the question, anyone who took a 312 with me, No one?
Okay, so this is the question.
Commonly.
So then you will ask 312 students, what was your question, right? Commonly put, but
yours will be a little bit more…
At the finest. So…
Instead of, do you remember the quiz in the cache? I always give, OX441A, whatever,
and then you… I give a sequence of address, and then ask if you use a victim cache,
or if you use some words, right, whatever you… we learned the technique, right?
Multibank.
Cash?
We did. Every quiz I create, you need to go through, right?
And then, let's say, RRIP…
Okay? That was a final status of your cache, that's what I like. But, I found with
this way, you don't even connect a real program with this cache. So I will give a
program.
So, for example, you have for loop, I call zero.
I call I less than N.
I plus plus… then you have all loop, J, same way, okay? Then, let's say, if I have
S…
sum equal plus minus AIJ.
Then, I ask you to give a final contents of the cache. I configure the cache. So
from here.
Can you come up with,
This sequence, as long as you translate it to the sequence of a memory address,
it's exactly the same question, isn't it?
So it's no new question, but this part, you… I have, like, 5 to 6 quiz questions,
right? You are familiar with this. So you can translate it to the memory address.
And then with the memory address, you know you need to decompose it to three fields
based on cache configuration, and then you use this to figure out where to go, how
it is go, right?
So these things you need to study.
So, let's say… let's look at this part.
Can you translate this to assembly code? What happened first?
So, you will… you will write read, there are only… because we are talking about
memory, right? Data. So, there are two operations, either read or write. So.
Read happens first, or what write happens first.
Read. So what do you read? In first iteration?
Read… A 00.
Then, what do you do?
When we talk about memory, we don't care about computation, right?
this read, and then maybe they will use R1, then what it does? You do R1 plus R2,
assuming R2 has a sum, right? Then you do… what… what is the next line?
You store, means rye.
Some, right? Isn't it?
Where's the store coming from? Isn't the sum only stored at the end of all the
iterations? It'll just be, like, sum is a register until the end of everything? No.
your compiler… think about the example we went through. You load A00, and then you
add A00, and then F1,
Then you store. You have always 3 lines of code.
We had that example in the classroom all the time. This is the original code in C.
When we were doing stores, it's because we had an array we were assigning to. In
this example, we only have a local variable, which doesn't need to store at any
point here.
like, if I were a compiler, I would say this sum doesn't need to touch memory, so I
won't generate a store. I'll just read the values.
Mmm…
Yeah, we can modify the question such that some is a vector, and BR is some
individual. Yeah, I will change then. So, to… to get… together, okay, some…
Well, if you do that, then register until you finish all of them. No, we won't do
it.
But, Christine, if you're gonna repeatedly access and do a summation of…
Okay, so let me, let me do that. So there is a 4, whatever, I can make it. Okay, K,
let's say.
Alright.
So, don't do optimization, or I can just transpose. So, B, J, I, equal AIJ.
Okay.
That one. So let's do this.
But what happened?
You don't read B, right? You read A, your compiler, you read A00, 0,
And then you write to D, B00. What is the next section?
Next iteration. Read A01.
You do 0, 1, you are writing B10, right?
Depending on what is N. N equals 3, it goes to until A02 P20.
Then…
you… you are done with the J, and then you increment I, right? So what is the next
one?
Okay, one, zero. A… 1 giro…
Then you read, read, write, read, write…
G, 0… Yeah, switching. One, right?
Can you come up with a sequence? How many read-write sequence do you have?
If n equals 3, how many iterations do you do? 3 times 3, right? 9. And each
iteration you have.
two operations, one read, one write. So, 18. And then you can translate this to…
address, right? So you have 18 sequence, it's too much. Maybe I will have n equal 2
only.
Right?
we do a write, are we going to be used… is it going to say, like, write through or
write allocate? Yes, it will be given. Yeah, I try hard to make it as… same as
read.
So you don't have to worry too much.
Like, in this example, if we had write through, we wouldn't care about the rights
in terms of the cache. If it was write allocate, we would have to first read each
value of whatever access location B is for. So we would have a write back and a
location, so it would be the same.
You're always right.
Okay?
So as long as I have a location policy, it'll be same as read. When you have a
readmiss, you allocate. When you have a write miss, you will allocate. It's the
same thing. If I have non-allocate, it's totally different. You don't bring right
block, block for write to the cache, so it'll be different.
I can give you a nightmare with that.
the memory layout.
Yeah, definitely, definitely, yeah.
So the question, the quiz you have with the multibank, whatever, read that exactly
the same thing. Cash configuration will be given.
I don't have to touch that part. But only, instead of a sequence of a memory
address, I can give you a code.
And then this code, I will give the first address of A00, first address of B00. Is
it, like, short term, or is it by C?
It has to be written, okay?
Okay.
But always a row measure, isn't it? Row measure, and then you need to know the… I
will give you… it's an integer, and then integer 4 bytes or 8 bytes, it will be
given.
Then you know if you know the first element, and then where is the second element.
As long as you can figure out address, it's the same question.
If you want to use, like, some radio tools.
You can declare it as boiling.
Bye.
Yeah, yeah, it will cause too much problem, so I like this transpose question.
There is no argument, right? You need to write to the member.
Okay, so let's… Go to… module.
Kinda, close, continue.
Great, thank you.
Yeah.
And as I explained before, so I only check a couple of techniques through this
example, right? I cannot check all things we…
discuss. So, for example, like, we have 6 basic optimization, and then 12 optimized
advanced techniques, and in each one, you need to.
be able to identify the name and how it works? Is it help to reduce miss penalty,
misrate, or reduce hit time, right? So I should be able to know. Okay, I hate
description question, but there are so many topics we cover after midterm.
And I cannot come up with the example, everything, right? So, there will be a short
description question, of course.
And, I will deduct points if you wrongly discuss.
So, don't blah blah if you don't know. Okay, I will…
Upset? Be upset? Deduct points. Some students got minus, okay? Don't try. You just
read the question. If you don't know, leave it blank, okay? You don't…
it won't be the end of the world if you don't… you miss a couple of them, okay?
There are some, you know, things.
I check, okay? So, your reading assignment, when we go through this topic, I told
you, I would, you know, summarize this, like, do you remember? So, you need to go
through key concepts, and then use your own words and summarize, okay?
don't write… like, if space is this, 2 or 3 sentences is enough, okay? I just look
for key idea.
Roy.
So those are all, right? SIMD, I gave you the question already.
Right?
This, convoy chime question will be there.
Because I think it's important to understand how SIMD works.
through that exercise, so you want to review. Even before your interview, you want
to review.
And then the, things we discussed about the optimization for vector architecture,
there are many, okay, many. I cannot check everything. Let me quickly go through if
I can come up with some exos, you know, the calculation problem to check
Yeah, maybe one or two. Even there, even I… I use some quantitative way.
to check your understanding will be mostly from quiz, okay? I would… do quizzes
Couple times, okay?
Then, if not, then most of the things, like, why do you need the vector length
register, how it is used, and the vector mask register, you know, what is… what it
is, and then you can connect with the branch, right? Branch history, branch history
table, like that.
So this is, what… You need to do.
I didn't finalize my meeting quest, the final exam, but mainly it will be like
that. So, the GPU and loop level parallelism, we did some quiz, and then there are
terminology you want to know, okay?
So that's, mainly.
things. And then we talk about that Sanope protocol. I told you this crazy 31 will
be there, but then we have, two words, right? You know, block, we have a two-word,
last two digits not…
the number indicate the first 4 bytes, and the upper 2 is first the second 4 bytes,
okay? So, if your address is not appear in the figure, you need to be able to
navigate the cache structure. That's,
what I can't think of, like… Okay?
So, that's it.
We can autism the class early. Do we have any questions?
Perfectly.
32, it won't be in the quiz, it'll be final, so don't do it.
This is one of the exercise questions from textbook. Okay. But later on, I didn't
like it much, so I did… I didn't publish.
Just focus on your final preparation. I can release and then let you, you know,
read more, but let's not…
With your effort.
Okay.
Anything you want to go over? Yeah, the extension of Homework 5.
what does that… what does that do with the… like, what do you want us to do?
Because, this cache coherence protocol, the deadline was before I cover cache
coherence protocol, so I want you to do, one more time.
Okay, that's what's probably?
So, you didn't do textbook problem, right? So, for now, you should do all the
textbook problems I suggest to do, and submit it, okay?
Alright.
Is it clear?
I want you to practice more.
Maybe I will update, announcement of 5. So the deadline is just before final exam.
Oh.
While you prepare finals, you want to do the questions I suggest to study, you do
it, and then submit it.
Okay?
Anything else?
So, you're mostly a first-year MS student?
And PhD student, right? First year.
Who is the second year? Who is… anyone graduate? Someone graduate, right? The…
Okay, good.
Maybe I shouldn't ask of Personocus.
No, I was going to ask about the career things, but…
Anyway, good luck with your career, rest of your life, and anything… if you work
related to computer architecture area, let me know, okay? We can build a network.
Bang.
And, yeah, let me know if anything you learned from this class was helpful for your
interview. I got a lot of emails from undergrad students are very soft, they always
thank me. Oh, EJ, the things that you told me this, and then, you know…
That's all.
Like, huh?
Anyway, it's very cool.
It's time to end. Thank you. So, I will come a little bit early for the final, so
if you can come early, come early so that we can start your final exam very
sharply, so that you have enough time. Thank you. Thank you for joining this class.
Yay.
Kind of… what's up.
Remember to submit the proof for the project!