0% found this document useful (0 votes)
4 views39 pages

AI_Mastery_AIML_Handbook

AI mastery

Uploaded by

suresc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views39 pages

AI_Mastery_AIML_Handbook

AI mastery

Uploaded by

suresc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Sashikumaar Ganesan

AI Mastery: The Scalable AI & ML Handbook

Sashikumaar Ganesan
Department of Computational and Data Sciences
Indian Institute of Science, Bangalore
September 2023

Accelerating Intelligent
Learning Through Advanced
Parallel Computing

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook

Contents

1 Introduction 3
2 The Evolution of Computer Architecture 5
2.1 History of Computers 7
2.2 Measuring the Growth and Speed of Computers 9
2.3 Moore’s Law and Dennard Scaling 12
2.4 Timeline of Hardware Development 14
2.5 Changing Pace of Hardware Development 15
2.6 Class of Computers 16
2.7 Birth of Parallel Computing 19
3 Parallel Computing and Memory Hierarchies 24
3.1 Overview of Computer Architecture 24
3.2 Memory Hierarchies 25
3.3 Shared vs Distributed Memory 26
3.4 Cache Optimisation & Techniques 32
3.5 Application to Machine Learning: Data loading and pre-processing 32
4 Mastering Shared Memory, GPU Computing, and Advanced Implementations 35
5 Distributed Dynamics: Exploring MPI and Parallel Implementations in ML 36
6 Scalable Frontiers: Navigating Parallel Libraries and Algorithms in ML 37

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 1

Copyright Policy

Zenteiq Edtech Private Limited


Bangalore 560015 INDIA

Introduction
This copyright policy applies to the digital textbook (the "Textbook")
created and distributed by Zenteiq Edtech Private Limited ("Zenteiq,"
"we", "our," or "us"). The purpose of this policy is to inform users of
the restrictions and allowances made under the copyright law and the
rights reserved by Zenteiq Edtech Private Limited.

Copyright Ownership
All rights, including copyright and intellectual property rights in the
Textbook, are owned by Zenteiq Edtech Private Limited. All rights
reserved.

Usage Rights
The Textbook is provided for personal, non-commercial use only. Users
are permitted to download and use the textbook for educational pur-
poses. Any reproduction, redistribution, or modification of the text-
book, or any part of it, other than as expressly permitted by this policy,
is strictly prohibited.

Prohibited Uses
Unless expressly permitted by us in writing, users may not:

• Reproduce, distribute, display, or transmit the Textbook or any part


thereof in any form or by any means.

• Modify, translate, adapt, or create derivative works from the text-


book.

• Remove any copyright or other proprietary notation from the text-


book.

• Commercially exploit the textbook.

Permissions
Any person wishing to use the Textbook in a manner not expressly
permitted by this policy should contact Zenteiq Edtech Private Limited
to request permission. Permissions are granted at the sole discretion
of Zenteiq Edtech Private Limited.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 2

Legal Action
Infringement of the rights in relation to the Textbook will lead to legal
action, which may include claims for damages, injunctions and / or
recovery of legal costs.

Amendments
Zenteiq Edtech Private Limited reserves the right to amend this copy-
right policy at any time without notice. The revised policy will be ef-
fective immediately upon posting on the relevant platform, and users
are deemed to be aware of and bound by any changes to the copyright
policy upon publication.

Contact Information
Any enquiries regarding this copyright policy, requests for permission
to use the Textbook, or any related questions should be directed to:

Zenteiq Edtech Private Limited


Bangalore 560015
Email: [email protected]
Web: www.zenteiq.com

Notice
This Textbook is protected by copyright law and international treaties.
Unauthorised reproduction or distribution of this textbook, or any por-
tion of it, may result in severe civil and criminal penalties, and will be
prosecuted to the maximum extent possible under the law.

©2023 Zenteiq Edtech Private Limited. All Rights Reserved

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 3

Architectural Evolution: The Journey of Computing

1 Introduction

In the current data-driven world, Machine Learning (ML) is a key tech-


nology that is used in a variety of applications, from self-driving cars
to financial trading systems. The sheer amount of data and the com-
plexity of ML algorithms require a great deal of computing power and
memory. Therefore, there is a need for more efficient methods of data
processing and analysis. Parallel computing is a solution to this prob-
lem, as it involves breaking down computational tasks into smaller
parts that can be processed at the same time, thus reducing the time
needed to complete the entire task.
The significance of parallel computing in the realm of machine learn-
ing cannot be overstated. First, it allows the management of large
datasets, a necessity for the training of precise ML models. As datasets
become larger, the need for computational power and memory in-
creases. Parallel computing solves this issue by distributing data and
computations across multiple processors, thus making it possible to
process large amounts of data quickly. Secondly, it accelerates compu-
tations, which is essential for tasks that involve repetitive operations, a
common trait of ML algorithms. By executing these operations simul-
taneously, parallel computing significantly reduces the training and
optimisation time of ML models. Furthermore, parallel computing in-
creases the energy efficiency of the overall system by spreading the
computational load across multiple processors or nodes. This is es-
pecially important for edge devices, which often have limited power
resources.
The use of parallel computing in Machine Learning is widespread
and varied. Deep learning, a subset of ML, often requires the train-
ing of neural networks with a large number of parameters on a large
amount of data. Parallel computing makes it possible to divide this
workload among multiple Graphical Processing Units (GPUs) or Cen-
tral Processing Units (CPUs), allowing for the training of large models
in a reasonable amount of time. Similarly, Natural Language Process-
ing (NLP) tasks, such as machine translation and sentiment analysis,
require the processing of a great deal of text data. Parallel computing
can speed up the preprocessing, feature extraction, and model train-
ing phases of these tasks. Computer Vision, another important area of
ML, involves computationally intensive operations, such as object de-
tection and image classification. These operations can be parallelised
to improve performance and enable real-time processing.
Parallel computing is essential for edge computing and federated
learning. Edge computing requires the optimisation of computations

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 4

to be executed quickly on devices with limited computational power.


Parallel computing is the answer to this problem, allowing for real-
time processing and decision making at the edge. Federated learning,
on the other hand, involves training models across multiple devices or
nodes while keeping the data localised. Parallel computing is benefi-
cial in this situation, as it enables the coordination and combination of
model updates from different nodes, allowing for the efficient training
and optimisation of the model without compromising data privacy.
To sum up, parallel computing is a key technology that enhances
the performance, speed, and energy efficiency of ML applications. It
is a response to the growing amount of data and the intricacy of ML
algorithms, allowing the production of real-time, energy-efficient, and
secure ML applications. This book is intended to provide a thorough
understanding of the principles and applications of parallel computing
in ML, including edge computing and federated learning. It will dis-
cuss the theoretical foundations, practical implementations, and real-
world applications of parallel computing in ML, equipping readers
with the knowledge and abilities needed to create and optimise ML
applications using parallel computing techniques.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 5

2 The Evolution of Computer Architecture

Why HPC?

How HPC?

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 6

Dot product with two CPUs.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 7

2.1 History of Computers


The history of computers is a fascinating journey that spans several
centuries. It encompasses a wide range of devices, from ancient me-
chanical calculators to modern-day supercomputers. This section will
provide an overview of the key milestones in the evolution of comput-
ers and discuss notable figures in the history of computing.

Ancient and Mechanical Computers


The history of computing can be traced back to ancient times when me-
chanical devices were used for calculations. The abacus, which dates
back to around 2400 BC, is one of the earliest known computing de-
vices. It consisted of beads that were moved along a series of rods or
wires to perform arithmetic calculations.
In the 17th century, the French mathematician and philosopher Blaise
Pascal developed the Pascaline, a mechanical calculator capable of per-
forming addition and subtraction. A few decades later, the German
mathematician and philosopher Gottfried Wilhelm Leibniz improved
upon Pascal’s design by creating a calculator that could also perform
multiplication and division.

The Birth of Electronic Computers


The development of electronic computers marked a significant mile-
stone in the history of computing. During the 1930s and 1940s, sev-
eral electronic computers were developed independently in different
parts of the world. One of the earliest electronic computers was the
Atanasoff-Berry Computer (ABC), developed by John Atanasoff and
Clifford Berry at Iowa State College (now Iowa State University) be-
tween 1937 and 1942. Although the ABC was never fully operational,
it laid the groundwork for future computer development.
The Electronic Numerical Integrator and Computer (ENIAC), de-
veloped by John Mauchly and J. Presper Eckert at the University of
Pennsylvania, is often considered the first general-purpose electronic
digital computer. Completed in 1945, ENIAC was initially designed to
calculate artillery firing tables for the U.S. Army, but it was later used
for a variety of scientific and engineering calculations.

The Advent of Stored-Programme Computers


The concept of a computer with a stored programmeme, where both
data and instructions are stored in the same memory, was a revolu-
tionary development in the history of computing. The idea was first
proposed by Alan Turing and John von Neumann independently in

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 8

the 1930s and 1940s.


The first operational stored-programme computer was the Manch-
ester Baby, developed at the University of Manchester in the United
Kingdom. It successfully executed its first programme on 21 June
1948. This achievement paved the way for the development of more
advanced stored-programme computers, such as the Electronic Delay
Storage Automatic Calculator (EDSAC) at the University of Cambridge
and the Ferranti Mark I at the University of Manchester.

The Rise of Personal Computers


The development of personal computers in the 1970s and 1980s democra-
tised computing and made it accessible to the general public. The
MITS Altair 8800, released in 1975, is often considered the first per-
sonal computer. It was followed by other influential personal com-
puters, such as the Apple I and II, the IBM PC, and the Commodore
64.
These computers revolutionised the way people interact with tech-
nology, enabling them to use computers for a wide range of applica-
tions, from word processing to gaming.

Notable Figures in the History of Computing


Several individuals made significant contributions to the development
of computing. Some notable figures include:
- Charles Babbage: An English mathematician and engineer, Bab-
bage is often referred to as the "father of the computer" for his work
on the design of the Analytical Engine, a mechanical general-purpose
computer that was never built.
- Ada Lovelace: An English mathematician and writer, Lovelace is
considered the world’s first computer programmer for her work on
Charles Babbage’s Analytical Engine.
- Alan Turing: An English mathematician, logician, and computer
scientist, Turing is best known for his work on the development of the
Turing machine, a theoretical computing device that laid the ground-
work for modern computer science.
- John von Neumann: A Hungarian-American mathematician and
physicist, von Neumann made significant contributions to the devel-
opment of the stored-program computer and the field of computer
science.

Summary
The history of computers is a storey of human ingenuity and inno-
vation. From ancient mechanical devices to modern electronic com-

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 9

puters, the development of computing technology has had a profound


impact on society and has shaped the world in which we live today.
The storey of computing is far from over, as researchers and engi-
neers continue to push the boundaries of what is possible, developing
new technologies and approaches that will shape the future of com-
puting.

Answer True/False with justification.

• The primary purpose of the earliest computers was general-purpose com-


puting accessible to the general public.

• The move from mechanical to electronic components in computers marked


a significant advancement in computational speed and reliability.

• Parallel processing has been a modern development, coming into promi-


nence only after the advent of personal computers.

• Mainframe computers were designed for individual use, similar to modern


personal computers.

• The switch from vacuum tubes to transistors marked an improvement in


energy efficiency and reliability in computers.

2.2 Measuring the Growth and Speed of Computers

Introduction to Computer Performance Metrics


The performance of a computer is a crucial factor in determining its
efficiency and effectiveness in performing tasks. Over the years, as
technology has advanced, the need for more powerful and faster com-
puters has become increasingly important. This has led to the de-
velopment of various metrics used to measure the performance of a
computer. These metrics include clock speed, instructions per second
(IPS), and floating point operations per second (FLOPS), among others.

Clock Speed
Clock speed, measured in hertz (Hz), is one of the most widely used
metrics to measure computer performance. It refers to the frequency at
which the central processing unit (CPU) of a computer operates. The
higher the clock speed, the more operations a CPU can perform in a
second. However, it is essential to note that a higher clock speed does
not always translate to better overall performance, as it is just one of
many factors that contribute to a computer’s performance. Other fac-
tors, such as CPU architecture, cache size, and instruction set, also play
a crucial role in determining the overall performance of a computer.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 10

Instructions Per Second (IPS)


Instructions per second (IPS) are another important metric used to
measure computer performance. It refers to the number of instruc-
tions that a CPU can execute in one second. IPS is a more comprehen-
sive measure of performance than clock speed, as it takes into account
the efficiency of the CPU architecture and the nature of the instruc-
tions being executed. Different types of instructions, such as simple
arithmetic operations or complex floating-point calculations, can have
a significant impact on the IPS value.

Floating-Point Operations Per Second (FLOPS)


FLOPS is a measure of computer performance that is particularly im-
portant in scientific and engineering applications. It refers to the num-

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 11

ber of floating-point operations that a computer can perform in one


second. Floating-point operations are used to represent and manip-
ulate real numbers, which are crucial in applications such as simula-
tions, data analysis, and graphical rendering. FLOPS is calculated by
multiplying the number of cores in a processor by the clock speed and
the number of floating-point operations performed per clock cycle.

Benchmarking
Benchmarking is the process of comparing the performance of a com-
puter or a specific component (such as a processor or a graphics card)
with that of other computers or components. It involves running a
set of standardised tests or applications, known as benchmarks, that
are designed to simulate real-world workloads and measure various
aspects of performance. Benchmarks can be classified into three main
categories: synthetic benchmarks, application benchmarks, and real-
world benchmarks. Synthetic benchmarks are designed to test specific
aspects of performance, such as processing speed or memory band-
width, in isolation. Application benchmarks, on the other hand, in-
volve running actual applications or parts of applications to measure
performance in real-world scenarios. Finally, real-world benchmarks
involve performing tasks that are representative of actual workloads
encountered by users, such as browsing the web, video editing or play-
ing.

Performance Metrics Over Time


Over the years, the way we measure computer performance has evolved.
In the early days of computing, clock speed and IPS were the most
commonly used metrics for measuring performance. However, as tech-
nology advanced and processors became more complex, other metrics,
such as FLOPS and benchmarks, became more important. Addition-
ally, with the advent of multi-core processors, it became crucial to con-
sider not only the performance of individual cores but also the overall
performance of the processor as a whole. This has led to the develop-
ment of new metrics, such as million instructions per second (MIPS)
and teraFLOPS (TFLOPS), which take into account the performance of
multiple cores. Furthermore, as power consumption and thermal effi-
ciency have become increasingly important factors in the design and
operation of computers, metrics such as performance per watt and
performance per square millimetre have gained importance.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 12

Summary
Measuring the growth and speed of computers is a complex task that
involves considering various factors and metrics. Clock speed, IPS,
and FLOPS are some of the most commonly used metrics, but it is im-
portant to consider other factors, such as CPU architecture, cache size,
and instruction set, to get a comprehensive view of a computer’s per-
formance. Additionally, benchmarking is a valuable tool for compar-
ing the performance of different computers and components in real-
world scenarios. Finally, it is essential to consider not only raw per-
formance but also power consumption and thermal efficiency when
evaluating the performance of a computer.
In summary, measuring the growth and speed of computers re-
quires a comprehensive approach that considers multiple factors and
metrics. By doing so, we can better understand the capabilities and
limitations of computers and make more informed decisions about
their design and operation.

Answer True/False with justification.

• Increasing the clock speed of a computer’s processor will always result in a


proportionate increase in overall system performance.

• The performance of a computer system can be solely determined by its


hardware specifications.

• Benchmarks provide a practical way to compare the performance of different


computer systems using real-world tasks.

• All metrics for measuring a computer’s speed and performance are univer-
sally applicable to every computational task.

• In general, a higher FLOPS value indicates a more powerful capability for


mathematical calculations.

2.3 Moore’s Law and Dennard Scaling

Moore’s Law
Moore’s Law is an observation made by Gordon Moore, cofounder
of Intel, in 1965. He observed that the number of transistors on a
microchip doubled approximately every two years, while the cost of
computers was halved. In its original form, Moore’s law stated that
the transistor density of semiconductor devices would double approx-
imately every year. However, in 1975, Moore revised the law to reflect
a doubling approximately every two years.
This observation became known as Moore’s law and has been in-
credibly accurate in predicting the growth of computing power over

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 13

the past several decades. It has been a driving force behind the ex-
ponential increase in computing power, leading to smaller, faster, and
cheaper electronic devices.
However, Moore’s law is not a physical or natural law, but rather
an observation and a projection of a historical trend. In recent years,
the pace of advancement predicted by Moore’s law has started to slow
down due to physical limitations, such as heat dissipation and quan-
tum effects, encountered as transistors approach the nanometre scale.
Despite these challenges, the semiconductor industry has continued
to innovate and find ways to increase computing power through other
means, such as multicore processors and 3D chip stacking.

Dennard Scaling
Dennard Scaling, named after Robert H. Dennard, is another key prin-
ciple that has driven the advancement of microelectronics. In 1974,
Dennard and his colleagues observed that as transistors were made
smaller, their power density remained constant, meaning that the power
consumption per unit area of a chip remained the same as the size of
the transistors decreased. This was because the voltage and current
were scaled down with the dimensions of the transistor.
Dennard scaling enabled the creation of smaller and more power-
efficient microprocessors, which contributed significantly to the de-
velopment of portable electronic devices such as laptops and smart-
phones.

Challenges and Limitations


Both Moore’s Law and Dennard Scaling have faced challenges as tran-
sistor sizes approach the atomic scale. For Moore’s Law, the main chal-
lenges include heat dissipation, quantum effects, and the prohibitive
cost of manufacturing ever-smaller transistors. For Dennard Scaling,
the main challenge is the increase in leakage current as the size of the
transistors decreases, leading to higher power consumption.
These challenges have led to a slowdown in the rate of improve-
ment in both transistor density and power efficiency. As a result, the
semiconductor industry has been exploring alternative approaches to
continue the advancement of computing power. These approaches in-
clude the development of new materials, the use of 3D chip stacking,
and the exploration of novel computing architectures, such as quan-
tum computing and neuromorphic computing.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 14

Summary
Moore’s Law and Dennard Scaling have been fundamental principles
that have guided the development of microelectronics for many decades.
However, as the size of transistors approaches the atomic scale, the
challenges associated with heat dissipation, quantum effects, and leak-
age current have led to a slowdown in the rate of improvement pre-
dicted by these principles. As a result, the semiconductor industry
is exploring alternative approaches to continue the advancement of
computing power, which will be crucial for the development of future
electronic devices and applications.

Answer True/False with justification.

• Moore’s law predicts the economic feasibility of doubling the number of


transistors on a chip every two years.

• Dennard Scaling deals with the computational power of chips, while Moore’s
law addresses their power consumption.

• The continuation of Moore’s law would suggest that computers will indef-
initely become faster while consuming less power.

• One of the implications of Dennard Scaling is that smaller transistors


would consume less power individually but maintain the same power den-
sity as larger ones.

• The slowing down of Moore’s law means that hardware innovations have
completely stagnated.

2.4 Timeline of Hardware Development


The development of computer hardware has seen dramatic changes
over the past several decades. The evolution from mainframes to mod-
ern special-purpose architectures has been marked by key milestones
that have shaped the computing landscape. This timeline highlights
some of the most significant developments in computer hardware.

1945: Mainframes
Mainframe computers, built from individual transistors, were the first
large-scale electronic digital computers. They were used primarily by
large organisations for critical applications, bulk data processing, and
complex calculations.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 15

1970: Minicomputer
Minicomputers, which were first discrete and then had low integra-
tion, were smaller and less expensive than mainframes. They were
designed to perform specific tasks and used in applications such as
industrial automation and scientific research.

End of 1970s: Microprocessors


The development of microprocessors led to the gradual incorporation
of these small integrated circuits into all machines, revolutionising
computing and enabling the development of personal computers.

Early 1980s: Reduced Instruction Set Computers (RISC)


The introduction of RISC architectures marked a shift toward simpler
instruction sets and more efficient computing. The goal was to achieve
one instruction per clock cycle, increasing the overall performance of
the computers.

Early 2000s: The end of Dennard scaling leads to multicore CPUs


The end of Dennard scaling, which described the miniaturisation of
transistors and its associated benefits, forced the industry to adopt
multicore CPUs to continue improving performance. This shift led to
an emphasis on thread-level parallelism.

Mid 2010s: Special-Purpose Architectures


The mid-2010s saw the rise of special-purpose architectures, such as
General-purpose graphic processing units (GPGPUs) and tensor cores.
These specialised hardware components enabled massive data par-
allelism and low-precision mixed-precision floating point operations
(FLOPs) for artificial intelligence (AI) and machine learning (ML) ap-
plications.

2.5 Changing Pace of Hardware Development

Till Mid 1980s: Technology Driven


The pace of hardware development was primarily technology-driven,
with performance doubling approximately every 3.5 years.

Starting 1986: Technology + RISC


The incorporation of RISC architectures led to a faster pace of devel-
opment, with performance doubling approximately every 2 years.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 16

2003 - 2011: End of Dennard scaling


The end of Dennard scaling marked a slowdown in the pace of hard-
ware development, with performance doubling approximately every
3-5 years. This period saw the forced adoption of multicore architec-
tures.

2011 - 2015: Continued slowdown


The pace of hardware development continued to slow down, with per-
formance doubling approximately every 8 years. This period marked
a significant challenge for the computing industry as it sought new
ways to improve performance despite the physical limitations of exist-
ing technology.

Summary
The timeline of hardware development highlights the incredible progress
that has been made in the field of computing over the past several
decades. From the early days of mainframes to the rise of special-
purpose architectures, the evolution of computer hardware has been
marked by continuous innovation and adaptation to new challenges.
As we look to the future, it is clear that the pace of hardware develop-
ment will continue to be influenced by a combination of technological
advancements and architectural innovations.

Answer True/False with justification.

• The development of silicon chips was pivotal because they allowed more
transistors to be packed into a smaller space, increasing computational
power.

• The introduction of SSDs over HDDs was purely an aesthetic choice, with
no impact on system performance.

• GPUs, due to their architecture, are inherently better at tasks that require
parallel processing compared to traditional CPUs.

• Quantum computers, when fully realized, will likely replace all traditional
computers because they are superior in every computational aspect.

• Moving from mechanical to electronic storage mediums generally resulted


in faster data access times and increased reliability.

2.6 Class of Computers


The realm of computing has progressed, introducing a variety of com-
puter types created to meet particular objectives and uses. From diminu-

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 17

tive, energy-efficient gadgets that power our everyday lives to im-


mense warehouses of machines providing for global data requirements,
let us explore these classifications in detail.

IoT/Embedded Computers
IoT (Internet of Things) refers to the network of physical devices em-
bedded with sensors, software, and other technologies to connect and
exchange data with other systems over the Internet. Embedded com-
puters are specialised systems designed to perform dedicated tasks or
functions within a larger system. They are typically optimised for spe-
cific applications and do not require user intervention.

Characteristics and Usage:

• Compact and Energy Efficient: Due to the nature of their tasks, em-
bedded systems are designed to be compact and consume minimal
power.

• Real-time operation: Many embedded systems operate in real-time,


meaning they respond instantly to inputs or changes in the envi-
ronment.

• Examples: Smart thermostats, wearable fitness trackers, automotive


control systems, and smart home devices.

Personal Mobile Devices (PMDs)


Portable mobile devices (PMDs) are capable of providing a variety of
features, such as computing, communication, and multimedia play-
back. These gadgets are powered by a battery and can be conveniently
carried in pockets or small bags.

Characteristics and Usage:

• Portability: Designed to be lightweight and compact.

• Connectivity: Equipped with Wi-Fi, Bluetooth, and cellular data


capabilities.

• Examples: Smartphones, tablets, smartwatches, and e-readers.

Desktop Computers
Desktop computers are designed to be used in one place and include
a monitor, keyboard, mouse, and the main processing unit. These ma-
chines are intended for regular use.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 18

Characteristics and Usage:

• Versatility: Suitable for a broad range of tasks from word process-


ing, gaming, graphic design, and more.

• Upgradeability: Components such as RAM, storage, and graphics


cards can often be upgraded.

• Examples: Workstations, personal computers, gaming rigs.

Servers/Workstations
Servers are powerful computers designed to process requests and send
data to other computers over a local network or the Internet. Work-
stations are high-end computers designed for technical or scientific
applications.

Characteristics and Usage:

• High Performance: Equipped with advanced processors, ample RAM,


and large storage capacities.

• Reliability: Built to run 24/7 and offers features such as error check
and redundant power supplies.

• Examples: Data centres, web hosting, scientific simulations.

Clusters/Warehouse-Scale Computers (WSCs)


Groups of computers that are connected and work together to increase
performance or reliability are known as clusters. Warehouse-scale
computers (WSCs) are large data centres that contain servers, storage,
and networking capabilities.

Characteristics and Usage:

• Massive Scale: Designed to cater to millions of users simultane-


ously, like cloud service providers.

• Redundancy: Multiple backups ensure no single point of failure,


offering high availability.

• Examples: Google’s data centres, Amazon Web Services infrastruc-


ture, and supercomputing clusters.

Summary
The variety of available computers has opened up a wide range of pos-
sibilities, from individual use to worldwide operations. As technology

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 19

continues to develop, we can expect to see more specialised types of


computer being created to meet increasingly specific requirements and
difficulties.

2.7 Birth of Parallel Computing

Introduction to Parallel Computing


Parallel computing is a computing model that allows multiple pro-
cessors or computing units to simultaneously execute multiple tasks
or parts of a single task. This approach is designed to solve large
and complex problems more quickly and efficiently by dividing the
workload across multiple processors or computers. Parallel comput-
ing emerged as a response to the limitations of serial or sequential
computing, where tasks are processed one after another by a single
processor.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 20

Cache Hierarchy

How executions happen?

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 21

Dual-core system

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 22

Development of Parallel Computing


The development of parallel computing can be traced back to the mid-
20th century. The first notable example of parallel computing was
the ILLIAC IV, developed at the University of Illinois in the 1960s.
This computer had 64 processing elements and was one of the first
machines designed to execute multiple instructions simultaneously.
However, the real breakthrough in parallel computing came with
the development of vector processors and the introduction of multi-
core processors. Vector processors, developed by companies like Cray,
allowed multiple operations to be performed simultaneously on large
data sets. In the 1980s and 1990s, companies like Intel, IBM, and
Sun Microsystems started developing multicore processors, which in-
tegrated multiple processor cores on a single chip. This development
marked the beginning of the widespread adoption of parallel comput-
ing in mainstream computing applications.

Parallel Computing Architectures


There are different parallel computing architectures, each with its char-
acteristics and applications. The two main types of parallel computing
architecture are SIMD (Single Instruction, Multiple Data) and MIMD
(Multiple Instruction, Multiple Data).

SIMD (Single Instruction, Multiple Data)


In the SIMD architecture, a single instruction is executed simultane-
ously on multiple data elements by multiple processing units. This ar-
chitecture is suitable for applications where the same operation must
be performed on large data sets, such as in image and signal process-
ing.

MIMD (Multiple Instruction, Multiple Data)


In the MIMD architecture, multiple instructions are executed simulta-
neously on multiple data elements by multiple processing units. Each
processing unit can execute a different instruction on different data el-
ements. This architecture is more flexible than SIMD and is suitable
for applications that require more complex computations and control
structures.

Challenges and Future Trends


Despite its advantages, parallel computing also comes with challenges.
One of the main challenges is the difficulty of programming paral-
lel applications. Writing parallel programmes requires a deep un-

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 23

derstanding of the problem to be solved and the parallel computing


architecture that is used. It also requires the use of specialised pro-
gramming languages and tools.
Another challenge is the scalability of parallel applications. As the
number of processors or computing units increases, the communica-
tion and synchronisation overhead also increases, which can limit the
performance gains obtained from parallel computing.
Despite these challenges, parallel computing continues to be a cru-
cial area of research and development. With the continued growth of
data and the demand for real-time processing, the need for faster and
more efficient computing solutions will only increase. Researchers and
practitioners are working on new parallel computing architectures, al-
gorithms, and tools to overcome current challenges and meet future
computing needs.

Summary
Parallel computing has played a key role in the advancement of com-
puting technology and has enabled us to solve complex problems more
quickly and efficiently. With the continued growth of data and the de-
mand for real-time processing, the importance of parallel computing
will only increase. Despite the challenges associated with parallel com-
puting, ongoing research and development in this area is expected to
lead to new and improved parallel computing solutions that will shape
the future of computing.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 24

3 Parallel Computing and Memory Hierarchies

Parallel computing and memory hierarchies are fundamental concepts


in computer science that have significant implications in various ap-
plications, including machine learning. The objective of this section is
to provide a comprehensive introduction to parallel computing, mem-
ory hierarchies, and their applications in machine learning, specifi-
cally data loading and preprocessing. This section will begin with an
overview of computer architecture, followed by a discussion on mem-
ory hierarchies, shared vs. distributed memory, and cache optimiza-
tion techniques. Additionally, we will explore the application of these
concepts to machine learning, focussing on optimising data loading
and preprocessing. Practical Python examples and assignment prob-
lems will be provided throughout the section to reinforce the concepts
and allow the reader to apply the knowledge in real-world scenarios.

3.1 Overview of Computer Architecture


Computer architecture is the design and organization of the hardware
components in a computer system, including the processor, memory,
input/output devices, and the connections between them. The pro-
cessor, or Central Processing Unit (CPU), is the core of the computer,
responsible for executing instructions. Modern processors often have
multiple cores, each capable of executing instructions independently,

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 25

enabling parallel processing.

Python Example:
In Python, the multiprocessing module allows you to create processes
that can run concurrently. Each process runs in its own Python inter-
preter.

Code Listing 1: Python code for multiprocessing


import m u l t i p r o c e s s i n g

def square ( n ) :
p r i n t ( " The square o f " , n , " i s " , n * n )

i f __name__ == " __main__ " :


p1 = m u l t i p r o c e s s i n g . P r o c e s s ( t a r g e t =square , a r g s = ( 4 , ) )
p2 = m u l t i p r o c e s s i n g . P r o c e s s ( t a r g e t =square , a r g s = ( 6 , ) )

p1 . s t a r t ( )
p2 . s t a r t ( )

p1 . j o i n ( )
p2 . j o i n ( )

p r i n t ( " End o f s c r i p t " )

Assignment Problem:
Construct a Python script that uses the multiprocessing module to
compute the factorial of two numbers simultaneously.

3.2 Memory Hierarchies


A computer system’s memory hierarchy consists of multiple storage
levels, each with its own capacity, speed, and cost. At the top of the
hierarchy are registers, which are small storage locations within the
processor that can be accessed quickly. Below the registers is the cache
memory, which is faster than the main memory (RAM), but has less
capacity. Cache memory is often divided into multiple levels (L1, L2,
L3) with varying sizes and speeds. The main memory, or RAM, is
larger and slower than the cache, but faster than the secondary mem-
ory. Secondary memory, such as hard drives or solid-state drives, has
the greatest capacity, but is the slowest.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 26

Figure 1: Memory Hierarchy in a


computer.

Python Example
Python does not give you direct access to the memory hierarchy. How-
ever, you can use the numpy library to create and manipulate large
arrays, which will, in turn, involve cache optimisation and memory
hierarchy.

Code Listing 2: Python code for dot product


import numpy as np

# Create a large array


a r r a y = np . random . rand ( 1 0 0 0 0 , 1 0 0 0 0 )

# Perform matrix m u l t i p l i c a t i o n
r e s u l t = np . dot ( array , a r r a y )

Assignment Problem
Create a Python program that generates two large matrices and per-
forms matrix multiplication. Measure the time taken to perform the
multiplication and try to optimize the code to reduce the computation
time.

3.3 Shared vs Distributed Memory


In parallel computing, there are two main memory architectures: shared
memory and distributed memory. In a shared-memory system, all
processors have access to a single shared memory space. This makes
it easy for processors to communicate and share data, but can lead
to contention and scalability issues as the number of processors in-

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 27

creases. In a distributed memory system, each processor has its own


private memory, and processors communicate by passing messages.
This eliminates contention and allows for better scalability, but makes
communication and data sharing more complex.

Python Example
Python’s multiprocessing module can be used to create processes that
share memory or exchange messages.

Code Listing 3: Python code for Shared Memory


import m u l t i p r o c e s s i n g

def square ( numbers , r e s u l t , index ) :


f o r idx , num in enumerate ( numbers ) :
r e s u l t [ idx ] = num * num
index . put ( idx )

i f __name__ == " __main__ " :


numbers = [ 2 , 3 , 5 ]
r e s u l t = m u l t i p r o c e s s i n g . Array ( ’ i ’ , 3 )
index = m u l t i p r o c e s s i n g . Queue ( )

p = m u l t i p r o c e s s i n g . P r o c e s s ( t a r g e t =square , a r g s =( numbers , r e s u l t , index ) )

p. start ()
p. join ()

while not index . empty ( ) :


p r i n t ( r e s u l t [ index . g e t ( ) ] )

Code Listing 4: Python code for Distributed Memory


import m u l t i p r o c e s s i n g

def square ( numbers , queue ) :


f o r num in numbers :
queue . put (num * num)

i f __name__ == " __main__ " :


numbers = [ 2 , 3 , 5 ]
queue = m u l t i p r o c e s s i n g . Queue ( )

p = m u l t i p r o c e s s i n g . P r o c e s s ( t a r g e t =square , a r g s =( numbers , queue ) )

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 28

p. start ()
p. join ()

while not queue . empty ( ) :


p r i n t ( queue . g e t ( ) )

Assignment Problem
Construct a Python script that computes the total of squares of a list
of numbers using both shared memory and distributed memory tech-
niques. Assess the effectiveness of both strategies.

Symmetric Multiprocessing (SMP)


Symmetric multiprocessing is a type of parallel processing architec-
ture in which multiple processors share the same memory and are
managed by a single operating system. Each processor is able to in-
dependently execute any process, and processes can be divided or
distributed across processors as needed..
Characteristics and Usage:

Figure 2: Shared memory computer.

• Uniform Memory Access (UMA): All processors have equal access


to main memory, making the memory access time uniform.

• Scalability: While SMP systems are scalable, there is a practical limit


to the number of processors due to memory access contention.

• Examples: Many traditional multicore servers and workstations.

Distributed Memory Processing (DMP)


DMP is used in many applications, such as parallel computing, dis-
tributed databases, and distributed artificial intelligence. Distributed
Memory Processing (DMP) is a type of computing which involves mul-
tiple processors, each with its own local memory. These processors

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 29

communicate with one another through a communication network,


exchanging messages or data. DMP is employed in a variety of ap-
plications, including parallel computing, distributed databases, and
distributed artificial intelligence.
Characteristics and Usage:

Figure 3: Distributed memory com-


puter.

• Non-Uniform Memory Access (NUMA): Since each processor ac-


cesses its local memory faster than the remote memory, memory
access times are nonuniform.

• Scalability: DMP architectures can be highly scalable since adding


more processors also means adding more memory, without increas-
ing the contention for any single memory.

• Examples: Many supercomputers and high-performance comput-


ing clusters.

Hybrid Architectures
Hybrid architectures combine the characteristics of both Symmetric
Multi-Processing (SMP) and Distributed Multi-Processing (DMP). These
systems may contain multiple processors (or multicore processors) that
share memory in an SMP fashion, while also communicating with
other similar sets through message-passing in a DMP manner.
Characteristics and Usage:

• Mixed Memory Access: Features both UMA and NUMA character-


istics depending on the data location.

• Flexibility: Hybrid architectures offer flexibility by leveraging the


strengths of both SMP and DMP. They can handle a variety of par-
allel workloads efficiently.

• Examples: Modern supercomputers like those used in large-scale


simulations or complex scientific computations often use hybrid ar-
chitectures to achieve maximum performance and scalability.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 30

Figure 4: Hybrid memory computer.

GPU Accelerators
A Graphics Processing Unit (GPU) accelerator is a type of specialised
electronic circuit created to speed up the processing of visuals and
videos for display. Over time, GPUs have advanced beyond graphics
applications and are now employed to speed up general-purpose sci-
entific and engineering computations.

Characteristics and Usage:

• Massively Parallel: GPUs are designed for tasks that can be bro-
ken down and processed simultaneously. They contain hundreds
to thousands of smaller cores designed for multithreaded, parallel
performance.

• High throughput: While individual cores might be slower than typ-


ical CPU cores, the sheer number of them allows for impressive data
processing rates, especially in tasks well-suited for parallelism such
as matrix operations or simulations.

• Specialised Memory Architecture: GPUs typically have high-bandwidth


memory, which is crucial for tasks involving large datasets, like
deep learning.

• General-Purpose GPU (GPGPU) Computing: With the advent of


platforms like NVIDIA’s CUDA and OpenCL, GPUs can be used
for tasks outside of graphics processing. This is termed GPGPU
(General-Purpose Computing on Graphics Processing Units).

• Examples: GPU-accelerated databases, deep learning training and


inference, molecular dynamics simulation, financial modelling, and,
of course, 3D gaming and graphics rendering.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 31

Benefits in Parallel Computing:

• Efficient Data Parallelism: For tasks where the same operation is


performed on different pieces of data, GPUs excel. This makes them
particularly effective for applications like neural networks, where
the same operation is performed on each node of a large network.

• Memory Hierarchy: GPUs have a unique memory hierarchy that


optimises for different access patterns, which can be exploited for
various algorithms.

• Energy Efficiency: Per computation, GPUs can often be more power


efficient than traditional CPUs, making them an attractive choice for
large-scale, power-hungry computations.

Challenges:

• Programming Complexity: Writing efficient GPU code can be chal-


lenging. It requires a deep understanding of the architecture and
memory hierarchy.

• Not Always the Best Tool: Although powerful, GPUs are not always
the best tool for the job. Tasks that are not inherently parallelizable
may not see significant speedup on a GPU.

Graphics Processing Units (GPUs) have had a major impact on par-


allel computing, particularly in areas where data parallelism can be
utilised. As they continue to progress and become integrated with
other computing systems, the potential of heterogeneous computing
systems appears to be very encouraging.

Answer True/False with justification.

• In parallel computing, adding more processors to a task will always result


in a proportionate decrease in the time required to complete the task.

• SIMD is beneficial when the same set of instructions needs to be performed


on different data elements simultaneously.

• An advantage of Distributed Memory Processing (DMP) is that all pro-


cessors can access memory at uniform speeds, regardless of where the data
are stored.

• In Symmetric Multiprocessing (SMP), a task can be split across multiple


processors, but all processors access a shared memory.

• GPGPU computing signifies that GPUs have transitioned from being solely
graphical units to also handling general computational tasks.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 32

3.4 Cache Optimisation & Techniques


Cache optimisation is essential for improving computer system per-
formance. Since cache memory is faster but has less capacity than
main memory, it is important to ensure that frequently accessed data
is stored in the cache. Various cache optimisation techniques can be
employed, such as cache blocking, loop tiling, and cache-conscious
data layout. Cache blocking involves breaking up a larger computa-
tion into smaller blocks that can fit into the cache, thus reducing the
number of cache misses. Loop tiling is a technique where nested loops
are divided into smaller, fixed-sized blocks, or tiles, that can fit into
the cache. Cache-conscious data layout involves organising data in
memory in a way that maximises cache utilisation.

Python Example
Python does not provide direct control over the cache, but you can
optimise your code for better cache performance by using efficient al-
gorithms and data structures.

Code Listing 5: Python code for optimize a code


# Example o f c a c h e − c o n s c i o u s d a t a l a y o u t

import numpy as np

# C r e a t e a l a r g e 2D a r r a y
a r r a y = np . random . rand ( 1 0 0 0 , 1 0 0 0 )

# A c c e s s t h e a r r a y i n a c a c h e − f r i e n d l y way
sum = 0
f o r i in range ( 1 0 0 0 ) :
f o r j in range ( 1 0 0 0 ) :
sum += a r r a y [ i ] [ j ]

3.5 Application to Machine Learning: Data loading and pre-processing


Machine learning applications often involve processing large amounts
of data, which can be a bottleneck in terms of both computation and
memory. Parallel computing can help accelerate data loading and pre-
processing by distributing work across multiple processors or nodes.
For example, data can be divided into smaller batches that can be
loaded and processed in parallel. Additionally, memory hierarchies
and cache optimisation techniques can be employed to ensure that
data are accessed efficiently during pre-processing.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 33

Python Example
Using the Python multiprocessing module, you can parallelise the data
loading and preprocessing steps.

Code Listing 6: Python code for dot product


import m u l t i p r o c e s s i n g
import numpy as np

def l o a d _ a n d _ p r e p r o c e s s _ d a t a ( filename , queue ) :


data = np . l o a d t x t ( f i l e n a m e )
# Preprocess the data
data = data / 2 5 5 . 0
queue . put ( data )

i f __name__ == " __main__ " :


f i l e n a m e s = [ ’ data1 . t x t ’ , ’ data2 . t x t ’ , ’ data3 . t x t ’ ]
queue = m u l t i p r o c e s s i n g . Queue ( )

p r o c e s s e s = [ m u l t i p r o c e s s i n g . P r o c e s s ( t a r g e t =load_and_preprocess_data ,
a r g s =( filename , queue ) ) f o r f i l e n a m e in f i l e n a m e s ]

f o r p in p r o c e s s e s :
p. start ()

f o r p in p r o c e s s e s :
p. join ()

while not queue . empty ( ) :


data = queue . g e t ( )
p r i n t ( data )

Assignment Problem
Construct a Python script that loads and preprocesses a collection of
pictures in a simultaneous manner. Estimate the time taken to load
and preprocess the images and compare it to the sequential technique.

Summary
In this section, we have explored the fundamentals of parallel comput-
ing and memory hierarchies and their application to machine learn-
ing. We begin by summarising computer architecture and then dive
into the specifics of memory hierarchies, including registers, cache,
main memory, and secondary memory. We then compared shared and

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 34

distributed memory systems and discussed various cache optimisa-


tion techniques. Additionally, we looked at how these concepts can be
applied to machine learning, particularly in optimising data loading
and preprocessing. The Python examples and assignment problems
provided throughout this section are intended to help readers under-
stand the concepts discussed and motivate them to use the knowledge
gained to improve performance and efficiency in machine learning ap-
plications. Understanding these principles is essential for developing
and optimising machine learning applications, resulting in consider-
able improvements in performance and efficiency.

This chapter provided a comprehensive overview of computer ar-


chitecture, intertwining the concepts of growth, historical evo-
lution, computational laws, hardware advancements, and par-
allel computing. In-depth exploration of these elements yields
a cohesive understanding of the multifaceted nature of comput-
ing, painting a holistic picture of how each component, innova-
tion, and concept collectively contribute to the current and future
landscape of Computational and Data Science.

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 35

4 Mastering Shared Memory, GPU Computing, and Advanced


Implementations

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 36

5 Distributed Dynamics: Exploring MPI and Parallel Imple-


mentations in ML

ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 37

6 Scalable Frontiers: Navigating Parallel Libraries and Algo-


rithms in ML

ZenteiQ

You might also like