0% found this document useful (0 votes)
198 views79 pages

RAID Disk Storage Dependability Overview

The document discusses storage systems, specifically disks and disk arrays. It begins by providing background on disks and their components like sectors and tracks. It then covers disk arrays like RAID levels 0 through 6, which provide data redundancy to protect against disk failures. The key points are that RAID distributes data across multiple disks for redundancy, and higher RAID levels can tolerate more disk failures. Dependability is discussed in terms of faults, errors, and failures - where faults can cause latent errors and failures occur when errors impact system behavior.

Uploaded by

sdhanesh84
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views79 pages

RAID Disk Storage Dependability Overview

The document discusses storage systems, specifically disks and disk arrays. It begins by providing background on disks and their components like sectors and tracks. It then covers disk arrays like RAID levels 0 through 6, which provide data redundancy to protect against disk failures. The key points are that RAID distributes data across multiple disks for redundancy, and higher RAID levels can tolerate more disk failures. Dependability is discussed in terms of faults, errors, and failures - where faults can cause latent errors and failures occur when errors impact system behavior.

Uploaded by

sdhanesh84
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

11

Storage Systems
Disk, RAID, Dependability

Kai Bu
kaibu@[Link]
[Link]
1960s – 1980s
Computing Revolution
1990 –
Information Age
Communication
Computation
Storage
Communication
Computation
Storage
Communication
Computation
Storage
Communication
Computation
Storage
requires higher standard of
dependability
than the rest of the computer
Communication
Computation
Storage ?
requires higher standard of
dependability
than the rest of the computer
program crash
Communication
Computation
Storage
requires higher standard of
dependability
than the rest of the computer
Communication
Computation
data loss
Storage
requires higher standard of
dependability
than the rest of the computer
Memory Hierarchy

temporary stor age

permanent
storage
Communication

Storage
magnetic disks dominate
Preview
• Disk
• Disk Array: RAID
• Dependability: Fault, Error, Failure
let’s start from a single disk
Disk

[Link]
Disk: wiki
track
(track) sector
geometrical
sector

cluster
[Link]
Disk: wiki
track
• Sector sector
geometrical
minimum storage unit sector
a block may span multiple sectors

cluster
[Link]
Disk: wiki
track
• Sector sector
geometrical
minimum storage unit sector
a block may span multiple sectors

• Cluster
(dis)contiguous groups of sectors
to reduce the overhead of managing
on-disk data structures;
cluster
may span more than one track [Link]
Disk

[Link]
Disk: locate data

CHS index
Disk Capacity
• Areal Density
=bits/inch2
=(tracks/inch) x (bits-per-track/inch)
Disk Capacity
• Areal Density
in 2011, the highest density
400 billion bits/inch2
• Costs per gigabyte
between 1983 and 2011,
improved by almost
a factor of 1,000,000
Disk vs DRAM

Cost DRAM >> DISK


Access time DRAM << DISK
Disk’s Competitor
• Flash Memory
non-volatile semiconductor memory;
same bandwidth as disks;
100 to 1000 times faster;
15 to 25 times higher cost/gigabyte;
• Wear out
limited to 1 million writes
• Popular in cell phones,
but not in desktop and server
Disk Power
• Power by disk motor
≈Diameter4.6 x RPM2.8 x No. of platters
RPM: Revolutions Per Minute rotation speed
Disk Power
• Power by disk motor
≈Diameter4.6 x RPM2.8 x No. of platters
RPM: Revolutions Per Minute rotation speed
• Smaller patters, slower rotation, and
fewer platters reduce disk motor power
Disk Power
disk
what if one is not enough…
disk failure
all or nothing
what if one is not enough…
disk failure
all or nothing
what if one is not enough…
disk failure
all or nothing
Disk Arrays
• Disk arrays with redundant disks to
tolerate faults
• If a single disk fails, the lost
information is reconstructed from
redundant information
• Striping: simply spreading data over
multiple disks
• RAID: redundant array of
inexpensive/independent disks
RAID
RAID 0
• JBOD: just a bunch of disks
• No redundancy
• No failure tolerated
• Measuring stick for other RAID levels:
cost, performance, and dependability
RAID 1
• Mirroring or Shadowing
• Two copies for every piece of data
• one logical write = two physical writes
• 100% capacity/space
overhead

[Link]
[Link]
RAID 2
• [Link]
• Each bit of data word is written to a data
disk drive
• Each data word has its (Hamming Code) ECC
word recorded on the ECC disks
• On read, the ECC code verifies correct data
or corrects single disks errors
RAID 3
• [Link]
• Data striped over all data disks
• Parity of a stripe to parity disk
• Require at least 3 disks to implement
RAID 3
• Even Parity
parity bit makes P
the # of 1 even
1 1 1 1
• p = sum(data1) mod
0 2 1 0 1
1 0 1 0
0 0 0 0
0 1 0 1
0 1 0 1
1 0 1 0
1 1 1 1
RAID 3
• Even Parity
parity bit makes P
the # of 1 even
1 1 1
• p = sum(data1) mod 0 2 0 1
• Recovery 1 1 0
0 0 0
if a disk fails, 0 0 1
“subtract” good data0 0 1
1 1 0
from good blocks; 1 1 1
what remains is missing data;
“subtract”
1–1=0
RAID 3 1–0=1
0–1=1
0–0=0

• Even Parity
parity bit makes P
the # of 1 even
1 1 1 1
• p = sum(data1) mod 0 2 1 0 1
• Recovery 1 0 1 0
0 0 0 0
if a disk fails, 0 1 0 1
“subtract” good data0 1 0 1
1 0 1 0
from p of good blocks;
1 1 1 1
what remains is missing data;
RAID 4
• [Link]
• Favor small accesses
• Allows each disk to perform
independent reads, using sectors’ own
error checking

independent read - not read across multiple disks


RAID 3 & RAID 4
bottleneck:
single parity disk

access:
parallel vs independent
RAID 3 & RAID 4
? bottleneck:
single parity disk

access:
parallel vs independent
RAID 5
• [Link]
• Distributes the parity info across all
disks in the array
• Removes the bottleneck of a single
parity disk as RAID 3 and RAID 4
RAID 6: Row-diagonal Parity
• RAID-DP
Recover from two failures

xor⊕
row: 00⊕11⊕22⊕33=r4
diagonal: 01⊕11⊕31⊕r1=d1
RAID 6: Row-diagonal Parity
• RAID-DP
Recover from two failures

xor⊕
row: 00⊕11⊕22⊕33=r4
diagonal: 01⊕11⊕31⊕r1=d1
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
Double-Failure Recovery
RAID: Further Readings
• Raid Types – Classifications
[Link]

[Link]
calculator/[Link]
• RAID
JetStor
[Link]
RAID: Further Readings
• More error detection and recovery
schemes
• Reed Solomon coding:
[Link]
v=jgO09opx56o
• Erasure coding, etc.
When are disks dependable
and when are they not?
Dependability
• Computer system dependability is the
quality of delivered service such that
reliance can justifiably be placed on
this service.
• The service delivered by a system is
its observed actual behavior as
perceived by other system(s)
interacting with this system’s users.
• Each module also has an ideal
specified behavior, where a service
specification is an agreed description
of the expected behavior.
Failure
• A system failure occurs when the
actual behavior deviates from the
specified behavior.
Error, Fault
• The failure occurred because of an
error, a defect in that module.
• The cause of an error is a fault.

• When a fault occurs, it creates a latent


error, which becomes effective when
it is activated;
• When the error actually affects the
delivered service, a failure occurs.
Fault, Error, Failure
• A fault creates one or more latent
errors
• Either an effective error is a formerly
latent error in that component or it has
propagated from another error in that
component or from else where
• A component failure occurs when the
error affects the delivered service
Failure
affect the delivered service

activated to be effective

Error
one or more latent errors

Fault
Categories of Faults by Cause
• Hardware faults
failed devices
• Design faults
usually in software design;
occasionally in hardware design;
• Operation faults
mistakes by operations and maintenance
personnel;
• Environmental faults
fire, flood, earthquake, power failure,
sabotage;
Categories of Faults by Duration

• Transient faults
exist for a limited time and are not
recurring;
• Intermittent faults
cause a system to oscillate between
faulty and fault-free operation
• Permanent faults
do not correct themselves with the
passing of time;
Example: Berkeley’s
Tertiary Disk
Example: Tandem
Example: Tandem
Appendix D.1–D.3
[Link]
?
Thank You
don’t stop believin’

there’s no in-between,
take it to extreme.
#What’s More
• Don’t Stop Believin’ / All Or Nothing /
Loser Like Me / I Lived /
• Wake Me Up / Stand / Brave /
Defying Gravity / Breakaway /
Roots Before Branches / Not The End

You might also like