0% found this document useful (0 votes)
42 views50 pages

Understanding Machine Code and Assembly

The document explains the fundamentals of machine code and assembly language, highlighting the differences between them, including the role of opcodes and operands. It covers the function of assemblers, the importance of comments, various types of operands, and the structure of assembly programs, including data segments and directives. Additionally, it addresses endianness and provides examples of data declaration and manipulation in assembly code.

Uploaded by

gizemgultoprak04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views50 pages

Understanding Machine Code and Assembly

The document explains the fundamentals of machine code and assembly language, highlighting the differences between them, including the role of opcodes and operands. It covers the function of assemblers, the importance of comments, various types of operands, and the structure of assembly programs, including data segments and directives. Additionally, it addresses endianness and provides examples of data declaration and manipulation in assembly code.

Uploaded by

gizemgultoprak04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 50

Machine code

Each type of CPU understands its own machine


language
Instructions are numbers that are stored in bytes in
memory
Each instruction has its unique numeric code, called the
opcode
Instruction of x86 processors vary in size
Some may be 1 byte, some may be 2 bytes, etc.
Many instructions
opco include operands as well
operands
de
Example:
On x86 there is an instruction to add the content of EAX to
the content of EBX and to store the result back into EAX
This instruction is encoded (in hex) as: 03C3
Clearly, this is not easy to read/remember
Assembly code
An assembly language program is stored as
text
Each assembly instruction corresponds to
exactly one machine instruction
Not true of high-level programming languages
E.g.: a function call in C corresponds to many,
many machine instructions
The instruction on the previous slides (EAX =
EAX + EBX) is written simply as:
add eax, ebx

mnemon operand
ic s
Assembler
An assembler translates assembly code into
machine code
Assembly code is NOT portable across
architectures
Different ISAs, different assembly language
In this course we use the Netwide Assembler
(NASM) assembler to write 32-bit Assembler
You can install it on your own machine
Note that different assemblers for the same
processor may use slightly different syntaxes
for the assembly code
The processor designers specify machine code,
which must be adhered to 100%, but not assembly
code syntax
Comments
Before we learn any assembly, it’s
important to know how to insert
comments into a source file
Uncommented assembly is a really, really,
really bad idea
Comments are important in any language,
but for a language as low-level as
assembly they are completely necessary
With NASM comments are added after
a ‘;’
Example:
add eax, ebx ; this is a comment
Operands
Since assembly instructions can have operands, it’s
important to know what kind of operands are possible
Register: specifies one of the registers
add eax, ebx
eax = eax + ebx
Memory: specifies an address in memory.
add eax, [ebx]
eax = eax + content of memory at address ebx
Immediate: specifies a fixed value (i.e., a number)
add eax, 2
eax = eax + 2
Implied: not actually encoded in the instruction
inc eax
eax = eax + 1
The move instruction
This instruction moves data from one location to
another
mov dest, src
Note that destination goes first, and the source goes
second
At most one of the operands can be a memory operand
mov eax, [ebx] 
mov [eax], ebx 
mov [eax], [ebx] 
Both operands must be exactly the same size
For instance, AX cannot be stored into BL
This type of “exceptions to the common case” make
programming languages difficult to learn and assembly
may be the worst offender here
Examples:
mov eax, 3
mov bx, ax
Additions, subtractions
Additions
add eax, 4 ; eax = eax + 4
add al, ah ; al = al + ah
Subtractions
sub bx, 10 ; bx = bx - 10
sub ebx, edi ; ebx = ebx - edi
Increment, Decrement
inc ecx ; ecx++(a 4-byte operation)
dec dl ; dl-- (a 1-byte operation)
Assembly directives
Most assembler provides “directives”, to do some
things that are not part of the machine code per se
Defining immediate constants
Say your code always uses the number 100 for a specific
thing, say the “size” of an array
You can just put this in the NASM code:
%define SIZE 100
Later on in your code you can just do things like:
mov eax, SIZE
Including files
%include “some_file”
If you know the C preprocessor, these are the same
ideas as
#define SIZE 100 #include “stdio.h”
Good idea to use %define whenever possible to avoid
“code duplication”
C Driver for Assembly code
Creating a whole program in assembly requires a lot of
work
e.g., set up all the segment registers correctly
You will rarely write something in assembly from
scratch, but rather only pieces of programs, with the
rest of the programs written in higher-level languages
like C
So, in this class we will “call” our assembly code from C
The main C function is called a driver

int main() // C driver


{ ...
int ret_status; add eax, ebx
ret_status = asm_main(); mov ebx, [edi]
return ret_status; ...
}
NASM Program Structure

data segment initializ statically


ed allocated data
data
uninitializ that is allocated
bss segment
ed for the duration of
data program
text segment execution

cod
e
The data and bss segments
Both segments contains data directives that declare
pre-allocated zones of memory
There are two kinds of data directives
DX directives: initialized data (D = “defined”)
RESX directives: uninitialized data (RES =
“reserved”)
The “X” above refers to the data size:
The DX data directives
One declares a zone of initialized
memory using three elements:
Label: the name used in the program to
refer to that zone of memory
A pointer to the zone of memory, i.e., an
address
DX, where X is the appropriate letter for
the size of the data being declared
Initial value, with encoding information
default: decimal
b: binary
h: hexadecimal
o: octal
DX Examples
L1 db 0
1 byte, named L1, initialized to 0
L2 dw 1000
2-byte word, named L2, initialized to 1000
L3 db 110101b
1 byte, named L3, initialized to 110101 in binary
L4 db 012h
1 byte, named L4, initialized to 12 in hex (note the ‘0’)
L5 db 17o
1 byte, named L5, initialized to 17 in octal (1*8+7=15 in
decimal)
L6 dd 0FFFF1A92h (note the ‘0’)
4-byte double word, named L6, initialized to FFFF1A92 in
hex
L7 db “A”
1 byte, named L7, initialized to the ASCII code for “A”
(65)
ASCII Code
Associates 1-byte numerical codes to
characters
Unicode, proposed much later, uses 2
bytes and thus can encode 28 more
characters (room for all languages,
Chinese, Japanese, accents, etc.)
A few values to know:
‘A’ is 65d, ‘B’ is 66d, etc.
‘a’ is 97d, ‘b’ is 98d, etc.
‘ ’ is 32d
ASCII Table
DX for multiple elements
L8db 0, 1, 2, 3
Defines 4 bytes, initialized to 0, 1, 2 and 3
L8 is a pointer to the first byte
L9db “w”, “o”, ‘r’, ‘d’, 0
Defines a null-terminated string,
initialized to “word\0”
L9 is a pointer to the beginning of the
string
L10 db “word”, 0
Equivalent to the above, more convenience
DX with the times qualifier
Say you want to declare 100 bytes all
initialized to 0
NASM provides a nice shortcut to do
this, the “times” qualifier
L11 times 100 db 0
Equivalent to L11 db 0,0,0,....,0 (100
times)
Data segment example
tmp dd -1
pixels db 0FFh, 0FEh, 0FDh, 0FCh
i dw 0
message db “H”, “e”, “l”, “l”, “o”, 0
buffer times 8 db 0
max dd 255

28
bytes

tmp pixels i message buffer max


(4) (4) (2) (6) (8) (4)
Data segment example
tmp dd -1
pixels db 0FFh, 0FEh, 0FDh, 0FCh
i dw 0
message db “H”, “e”, “l”, “l”, “o”, 0
buffer times 8 db 0
max dd 255

28
bytes
F F F F F F F F 0 0 4 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 F
F F F F F E D C 0 0 8 5 C C F 0 0 0 0 0 0 0 0 0 0 0 0 F
tmp pixels i message buffer max
(4) (4) (2) (6) (8) (4)
Endianness?
max dd 255 0 0 0 F
0 0 0 F
max

In the previous slide we showed the above 4-byte


memory content for a double-word that contains 255
= 000000FFh
While this seems to make sense, it turns out that Intel
processors do not do this!
Yes, the last 4 bytes shown in the previous slide are
wrong
The scheme shown above (i.e., bytes in memory
follow the “natural” order): Big Endian
Instead, Intel processors
F 0use
0 0Little Endian:
F 0 0 0
max
Little Endian
mov eax,
0AABBCCDDh
move [M1], eax
move ebx, [M1]

Register Memor
s y
ea [M
x 1]

eb
x
Little Endian
mov eax,
0AABBCCDDh
move [M1], eax
move ebx, [M1]

Register Memor
s y
ea A B C D [M
A B C D
x 1]

eb
x
Little Endian
mov eax,
0AABBCCDDh
move [M1], eax
move ebx, [M1]

Register Memor
s y
ea A B C D [M D C B A
A B C D
x 1] D C B A

eb
x
Little Endian
mov eax,
0AABBCCDDh
move [M1], eax
move ebx, [M1]

Register Memor
s y
ea A B C D [M D C B A
A B C D
x 1] D C B A

eb A B C D
A B C D
x
Little/Big Endian
Motorola and IBM computers use Big Endian
Intel uses Little Endian (we are using Intel in this class)
When writing code in a high-level language one rarely
cares
Although in C one can definitely expose the Endianness of
the computer
And thus one can write C code that’s not portable
between an IBM and an Intel!!!
This only matters when writing multi-byte quantities to
memory and reading them differently (e.g., byte per
byte)
When writing assembly code one often does not care,
but we’ll see several examples when it matters, so it’s
important to know this inside out
Some processors are configurable (either in hardware or
in software) to use either type of endianness (e.g., MIPS
processor)
In-class Exercise
pixels times 4 db 0FDh
x dd
00010111001101100001010111010011b
blurb db “a”, “d”, “b”, “h”, 0
buffer times 10 db 14o
min dw -19

What is the layout and the content of


the data memory segment?
Byte per byte, in hex
In-class Exercise
pixels times 4 db 0FDh
x dd
00010111001101100001010111010011b
blurb db “a”, “d”, “b”, “h”, 0
buffer times 10 db 14o
min dw -19

25
bytes
F F F F D 1 3 1 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 E F
D D D D 3 5 6 7 1 4 2 8 0 C C C C C C C C C C D F
pixels x blurb buffer min
(4) (4) (5) (10) (2)
Uninitialized Data
The RESX directive is very similar to the
DX directive, but always specifies the
number of memory elements
L20 resw100
100 uninitialized 2-byte words
L20 is a pointer to the first word
L21 resb 1
1 uninitialized byte named L21
Use of Labels
It is important to constantly be aware that when using
a label in a program, the label is a pointer, not a value
Therefore, a common use of the label in the code is as
a memory operand, in between square brackets ‘[‘ ‘]’
mov AL, [L1]
Move the data at address L1 into register AL
Question: how does the assembler know how many
bits to move?
Answer: it’s up to the programmer to do the right
thing, that is load into appropriately sized registers
Labels do not have a type!
So although it’s tempting to think of them as
variables, they are much more limited: just
pointers to a byte somewhere in memory
Moving to/from a register
Say we have the following data segment
L db 0F0h, 0F1h, 0F2h, 0F3h
Example: mov AL, [L]
AL: Lowest bits of AX, i.e., 1 byte
Therefore, value F0 is moved into AL
Example: mov [L], AX
Moves 2 bytes into L, overwriting the first two bytes
Example: mov [L], EAX
Moves 4 bytes into L, overwriting all four bytes
Example: mov AX, [L]
AX: 2 bytes
Therefore value F1F0 is moved into AL
Note that this is reversed because of Little Endian!!
More About Little Endian
Consider the following data segment
L1 db 0AAh, 0BBh, 0CCh, 0DDh
L2 dd 0AABBCCDDh
The instruction: mov eax, [L1]
puts DDCCBBAA into eax
Note that we’re loading 4x1 bytes as a 4-byte quantity
The instruction: mov eax, [L2]
puts AABBCCDD into eax!!!
When declaring a value in the data segment,
that value is declared as it would be appearing
in registers when loaded “whole”
It would be _really_ confusing to write numbers in little
endian mode in the program
Moving immediate values
Consider the instruction: mov [L], 1
The assembler will give us an error: “operation
size not specified”!
This is because the assembler has no idea
whether we mean for “1” to be 01h, 0001h,
00000001h, etc.
Again, labels have no type
Therefore the assembler provides us with a
way to specify the size of immediate operands
mov dword [L], 1
4-byte double-word
5 size specifiers: byte, word, dword, qword,
tword
Size Specifier Examples
mov [L1], 1 ; Error
mov byte [L1], 1 ; 1 byte
mov word [L1], 1 ; 2 bytes
mov dword [L1], 1 ; 4 bytes
mov [L1], eax ; 4 bytes
mov [L1], ax ; 2 bytes
mov [L1], al ; 1 byte
mov eax, [L1] ; 4 bytes
mov ax, [L1] ; 2 bytes
mov ax, 12 ; 2 bytes
Brackets or no Brackets
mov eax, [L]
Puts the content at address L into eax
Puts 32 bits of content, because eax is a 32-bit
register
mov eax, L
Puts the address L into eax
Puts the 32-bit address L into eax
mov ebx, [eax]
Puts the content at address eax (= L) into ebx
inc eax
Increase eax by one
mov ebx, [eax]
Puts the content at address eax (= L + 1) into ebx
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax
mov ebx, [eax]
mov [second], ebx
mov byte [third],
11o

What is the content of “data” memory after the code


executes?
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax
mov ebx, [eax]
mov [second], ebx
mov byte [third],
11o

0 4 1 A A 0 6 6 6 0 0 0 0 eax
0 F 2 4 5 0 1 4 6 0 0 0 0
first second third 0 0 0 0 ebx
(4) (2) (3) 0 0 0 0
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax Put an address into
mov ebx, [eax] eax
mov [second], ebx (addresses are 32-
mov byte [third], bit)
11o

0 4 1 A A 0 6 6 6 x x x x eax
0 F 2 4 5 0 1 4 6 x x x x
first second third 0 0 0 0 ebx
(4) (2) (3) 0 0 0 0
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax
mov ebx, [eax]
mov [second], ebx
mov byte [third],
11o

0 4 1 A A 0 6 6 6 x x x x eax
0 F 2 4 5 0 1 4 6 x x x x
first second third 0 0 0 0 ebx
(4) (2) (3) 0 0 0 0
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax
mov ebx, [eax]
mov [second], ebx
mov byte [third],
11o

0 4 1 A A 0 6 6 6 x x x x eax
0 F 2 4 5 0 1 4 6 x x x x
first second third A A 1 4 ebx
(4) (2) (3) 5 4 2 F
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax
mov ebx, [eax]
mov [second], ebx
mov byte [third],
11o

0 4 1 A 4 1 A A 6 x x x x eax
0 F 2 4 F 2 4 5 6 x x x x
first second third A A 1 4 ebx
(4) (2) (3) 5 4 2 F
Example
first db 00h, 04Fh, 012h, 0A4h
second dw 165
third db “adf”

mov eax, first


inc eax
mov ebx, [eax]
mov [second], ebx
mov byte [third],
11o

0 4 1 A 4 1 0 A 6 x x x x eax
0 F 2 4 F 2 9 5 6 x x x x
first second third A A 1 4 ebx
(4) (2) (3) 5 4 2 F
Assembly is Dangerous
Although the previous example is really a
terrible program, it’s a good demonstration of
how the assembly programmer must be really
careful
For instance, we were able to store 4 bytes into
a 2-byte label, thus overwriting the first 2
characters of a string that merely happened to
be stored in memory next to that 2-byte label
Playing such tricks can lead to very clever
programs that do things that would be
impossible (or very cumbersome) to do with a
high-level programming language (e.g., in Java)
But you really must know what you’re doing
In-Class Exercise
Consider the following program
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
What is the addof
layout ebx, 5
memory starting at
mov [var1], ebx
address var1?
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

var1 var2 var3


(4) (3) (3)
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

B 0 0 0 A 1 1 6 6 6
3 0 0 0 3 7 2 2 3 1
var1 var2 var3
(4) (3) (3)
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

B 0 0 0 A 1 1 6 6 6 x x x x ea
3 0 0 0 3 7 2 2 3 1 x x x x x
var1 var2 var3
(4) (3) (3)
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

B 0 0 0 A 1 1 6 6 6 x x x x ea
3 0 0 0 3 7 2 2 3 1 x x x x x
var1 var2 var3
(4) (3) (3)
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

B 0 0 0 A 1 1 6 6 6 x x x x ea
3 0 0 0 3 7 2 2 3 1 x x x x x
var1 var2 var3 1 1 A 0 eb
(4) (3) (3) 2 7 3 0 x
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

B 0 0 0 A 1 1 6 6 6 x x x x ea
3 0 0 0 3 7 2 2 3 1 x x x x x
var1 var2 var3 1 1 A 0 eb
(4) (3) (3) 2 7 3 5 x
In-Class Exercise
var1 dd 179
var2 db 0A3h, 017h, 012h
var3 db “bca”

mov eax, var1


add eax, 3
mov ebx, [eax]
add ebx, 5
mov [var1], ebx

0 A 1 1 A 1 1 6 6 6 x x x x ea
5 3 7 2 3 7 2 2 3 1 x x x x x
var1 var2 var3 1 1 A 0 eb
(4) (3) (3) 2 7 3 5 x

You might also like