AI_Mastery_AIML_Handbook
AI_Mastery_AIML_Handbook
Sashikumaar Ganesan
Department of Computational and Data Sciences
Indian Institute of Science, Bangalore
September 2023
Accelerating Intelligent
Learning Through Advanced
Parallel Computing
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook
Contents
1 Introduction 3
2 The Evolution of Computer Architecture 5
2.1 History of Computers 7
2.2 Measuring the Growth and Speed of Computers 9
2.3 Moore’s Law and Dennard Scaling 12
2.4 Timeline of Hardware Development 14
2.5 Changing Pace of Hardware Development 15
2.6 Class of Computers 16
2.7 Birth of Parallel Computing 19
3 Parallel Computing and Memory Hierarchies 24
3.1 Overview of Computer Architecture 24
3.2 Memory Hierarchies 25
3.3 Shared vs Distributed Memory 26
3.4 Cache Optimisation & Techniques 32
3.5 Application to Machine Learning: Data loading and pre-processing 32
4 Mastering Shared Memory, GPU Computing, and Advanced Implementations 35
5 Distributed Dynamics: Exploring MPI and Parallel Implementations in ML 36
6 Scalable Frontiers: Navigating Parallel Libraries and Algorithms in ML 37
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 1
Copyright Policy
Introduction
This copyright policy applies to the digital textbook (the "Textbook")
created and distributed by Zenteiq Edtech Private Limited ("Zenteiq,"
"we", "our," or "us"). The purpose of this policy is to inform users of
the restrictions and allowances made under the copyright law and the
rights reserved by Zenteiq Edtech Private Limited.
Copyright Ownership
All rights, including copyright and intellectual property rights in the
Textbook, are owned by Zenteiq Edtech Private Limited. All rights
reserved.
Usage Rights
The Textbook is provided for personal, non-commercial use only. Users
are permitted to download and use the textbook for educational pur-
poses. Any reproduction, redistribution, or modification of the text-
book, or any part of it, other than as expressly permitted by this policy,
is strictly prohibited.
Prohibited Uses
Unless expressly permitted by us in writing, users may not:
Permissions
Any person wishing to use the Textbook in a manner not expressly
permitted by this policy should contact Zenteiq Edtech Private Limited
to request permission. Permissions are granted at the sole discretion
of Zenteiq Edtech Private Limited.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 2
Legal Action
Infringement of the rights in relation to the Textbook will lead to legal
action, which may include claims for damages, injunctions and / or
recovery of legal costs.
Amendments
Zenteiq Edtech Private Limited reserves the right to amend this copy-
right policy at any time without notice. The revised policy will be ef-
fective immediately upon posting on the relevant platform, and users
are deemed to be aware of and bound by any changes to the copyright
policy upon publication.
Contact Information
Any enquiries regarding this copyright policy, requests for permission
to use the Textbook, or any related questions should be directed to:
Notice
This Textbook is protected by copyright law and international treaties.
Unauthorised reproduction or distribution of this textbook, or any por-
tion of it, may result in severe civil and criminal penalties, and will be
prosecuted to the maximum extent possible under the law.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 3
1 Introduction
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 4
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 5
Why HPC?
How HPC?
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 6
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 7
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 8
Summary
The history of computers is a storey of human ingenuity and inno-
vation. From ancient mechanical devices to modern electronic com-
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 9
Clock Speed
Clock speed, measured in hertz (Hz), is one of the most widely used
metrics to measure computer performance. It refers to the frequency at
which the central processing unit (CPU) of a computer operates. The
higher the clock speed, the more operations a CPU can perform in a
second. However, it is essential to note that a higher clock speed does
not always translate to better overall performance, as it is just one of
many factors that contribute to a computer’s performance. Other fac-
tors, such as CPU architecture, cache size, and instruction set, also play
a crucial role in determining the overall performance of a computer.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 10
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 11
Benchmarking
Benchmarking is the process of comparing the performance of a com-
puter or a specific component (such as a processor or a graphics card)
with that of other computers or components. It involves running a
set of standardised tests or applications, known as benchmarks, that
are designed to simulate real-world workloads and measure various
aspects of performance. Benchmarks can be classified into three main
categories: synthetic benchmarks, application benchmarks, and real-
world benchmarks. Synthetic benchmarks are designed to test specific
aspects of performance, such as processing speed or memory band-
width, in isolation. Application benchmarks, on the other hand, in-
volve running actual applications or parts of applications to measure
performance in real-world scenarios. Finally, real-world benchmarks
involve performing tasks that are representative of actual workloads
encountered by users, such as browsing the web, video editing or play-
ing.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 12
Summary
Measuring the growth and speed of computers is a complex task that
involves considering various factors and metrics. Clock speed, IPS,
and FLOPS are some of the most commonly used metrics, but it is im-
portant to consider other factors, such as CPU architecture, cache size,
and instruction set, to get a comprehensive view of a computer’s per-
formance. Additionally, benchmarking is a valuable tool for compar-
ing the performance of different computers and components in real-
world scenarios. Finally, it is essential to consider not only raw per-
formance but also power consumption and thermal efficiency when
evaluating the performance of a computer.
In summary, measuring the growth and speed of computers re-
quires a comprehensive approach that considers multiple factors and
metrics. By doing so, we can better understand the capabilities and
limitations of computers and make more informed decisions about
their design and operation.
• All metrics for measuring a computer’s speed and performance are univer-
sally applicable to every computational task.
Moore’s Law
Moore’s Law is an observation made by Gordon Moore, cofounder
of Intel, in 1965. He observed that the number of transistors on a
microchip doubled approximately every two years, while the cost of
computers was halved. In its original form, Moore’s law stated that
the transistor density of semiconductor devices would double approx-
imately every year. However, in 1975, Moore revised the law to reflect
a doubling approximately every two years.
This observation became known as Moore’s law and has been in-
credibly accurate in predicting the growth of computing power over
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 13
the past several decades. It has been a driving force behind the ex-
ponential increase in computing power, leading to smaller, faster, and
cheaper electronic devices.
However, Moore’s law is not a physical or natural law, but rather
an observation and a projection of a historical trend. In recent years,
the pace of advancement predicted by Moore’s law has started to slow
down due to physical limitations, such as heat dissipation and quan-
tum effects, encountered as transistors approach the nanometre scale.
Despite these challenges, the semiconductor industry has continued
to innovate and find ways to increase computing power through other
means, such as multicore processors and 3D chip stacking.
Dennard Scaling
Dennard Scaling, named after Robert H. Dennard, is another key prin-
ciple that has driven the advancement of microelectronics. In 1974,
Dennard and his colleagues observed that as transistors were made
smaller, their power density remained constant, meaning that the power
consumption per unit area of a chip remained the same as the size of
the transistors decreased. This was because the voltage and current
were scaled down with the dimensions of the transistor.
Dennard scaling enabled the creation of smaller and more power-
efficient microprocessors, which contributed significantly to the de-
velopment of portable electronic devices such as laptops and smart-
phones.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 14
Summary
Moore’s Law and Dennard Scaling have been fundamental principles
that have guided the development of microelectronics for many decades.
However, as the size of transistors approaches the atomic scale, the
challenges associated with heat dissipation, quantum effects, and leak-
age current have led to a slowdown in the rate of improvement pre-
dicted by these principles. As a result, the semiconductor industry
is exploring alternative approaches to continue the advancement of
computing power, which will be crucial for the development of future
electronic devices and applications.
• Dennard Scaling deals with the computational power of chips, while Moore’s
law addresses their power consumption.
• The continuation of Moore’s law would suggest that computers will indef-
initely become faster while consuming less power.
• The slowing down of Moore’s law means that hardware innovations have
completely stagnated.
1945: Mainframes
Mainframe computers, built from individual transistors, were the first
large-scale electronic digital computers. They were used primarily by
large organisations for critical applications, bulk data processing, and
complex calculations.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 15
1970: Minicomputer
Minicomputers, which were first discrete and then had low integra-
tion, were smaller and less expensive than mainframes. They were
designed to perform specific tasks and used in applications such as
industrial automation and scientific research.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 16
Summary
The timeline of hardware development highlights the incredible progress
that has been made in the field of computing over the past several
decades. From the early days of mainframes to the rise of special-
purpose architectures, the evolution of computer hardware has been
marked by continuous innovation and adaptation to new challenges.
As we look to the future, it is clear that the pace of hardware develop-
ment will continue to be influenced by a combination of technological
advancements and architectural innovations.
• The development of silicon chips was pivotal because they allowed more
transistors to be packed into a smaller space, increasing computational
power.
• The introduction of SSDs over HDDs was purely an aesthetic choice, with
no impact on system performance.
• GPUs, due to their architecture, are inherently better at tasks that require
parallel processing compared to traditional CPUs.
• Quantum computers, when fully realized, will likely replace all traditional
computers because they are superior in every computational aspect.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 17
IoT/Embedded Computers
IoT (Internet of Things) refers to the network of physical devices em-
bedded with sensors, software, and other technologies to connect and
exchange data with other systems over the Internet. Embedded com-
puters are specialised systems designed to perform dedicated tasks or
functions within a larger system. They are typically optimised for spe-
cific applications and do not require user intervention.
• Compact and Energy Efficient: Due to the nature of their tasks, em-
bedded systems are designed to be compact and consume minimal
power.
Desktop Computers
Desktop computers are designed to be used in one place and include
a monitor, keyboard, mouse, and the main processing unit. These ma-
chines are intended for regular use.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 18
Servers/Workstations
Servers are powerful computers designed to process requests and send
data to other computers over a local network or the Internet. Work-
stations are high-end computers designed for technical or scientific
applications.
• Reliability: Built to run 24/7 and offers features such as error check
and redundant power supplies.
Summary
The variety of available computers has opened up a wide range of pos-
sibilities, from individual use to worldwide operations. As technology
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 19
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 20
Cache Hierarchy
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 21
Dual-core system
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 22
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 23
Summary
Parallel computing has played a key role in the advancement of com-
puting technology and has enabled us to solve complex problems more
quickly and efficiently. With the continued growth of data and the de-
mand for real-time processing, the importance of parallel computing
will only increase. Despite the challenges associated with parallel com-
puting, ongoing research and development in this area is expected to
lead to new and improved parallel computing solutions that will shape
the future of computing.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 24
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 25
Python Example:
In Python, the multiprocessing module allows you to create processes
that can run concurrently. Each process runs in its own Python inter-
preter.
def square ( n ) :
p r i n t ( " The square o f " , n , " i s " , n * n )
p1 . s t a r t ( )
p2 . s t a r t ( )
p1 . j o i n ( )
p2 . j o i n ( )
Assignment Problem:
Construct a Python script that uses the multiprocessing module to
compute the factorial of two numbers simultaneously.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 26
Python Example
Python does not give you direct access to the memory hierarchy. How-
ever, you can use the numpy library to create and manipulate large
arrays, which will, in turn, involve cache optimisation and memory
hierarchy.
# Perform matrix m u l t i p l i c a t i o n
r e s u l t = np . dot ( array , a r r a y )
Assignment Problem
Create a Python program that generates two large matrices and per-
forms matrix multiplication. Measure the time taken to perform the
multiplication and try to optimize the code to reduce the computation
time.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 27
Python Example
Python’s multiprocessing module can be used to create processes that
share memory or exchange messages.
p. start ()
p. join ()
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 28
p. start ()
p. join ()
Assignment Problem
Construct a Python script that computes the total of squares of a list
of numbers using both shared memory and distributed memory tech-
niques. Assess the effectiveness of both strategies.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 29
Hybrid Architectures
Hybrid architectures combine the characteristics of both Symmetric
Multi-Processing (SMP) and Distributed Multi-Processing (DMP). These
systems may contain multiple processors (or multicore processors) that
share memory in an SMP fashion, while also communicating with
other similar sets through message-passing in a DMP manner.
Characteristics and Usage:
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 30
GPU Accelerators
A Graphics Processing Unit (GPU) accelerator is a type of specialised
electronic circuit created to speed up the processing of visuals and
videos for display. Over time, GPUs have advanced beyond graphics
applications and are now employed to speed up general-purpose sci-
entific and engineering computations.
• Massively Parallel: GPUs are designed for tasks that can be bro-
ken down and processed simultaneously. They contain hundreds
to thousands of smaller cores designed for multithreaded, parallel
performance.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 31
Challenges:
• Not Always the Best Tool: Although powerful, GPUs are not always
the best tool for the job. Tasks that are not inherently parallelizable
may not see significant speedup on a GPU.
• GPGPU computing signifies that GPUs have transitioned from being solely
graphical units to also handling general computational tasks.
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 32
Python Example
Python does not provide direct control over the cache, but you can
optimise your code for better cache performance by using efficient al-
gorithms and data structures.
import numpy as np
# C r e a t e a l a r g e 2D a r r a y
a r r a y = np . random . rand ( 1 0 0 0 , 1 0 0 0 )
# A c c e s s t h e a r r a y i n a c a c h e − f r i e n d l y way
sum = 0
f o r i in range ( 1 0 0 0 ) :
f o r j in range ( 1 0 0 0 ) :
sum += a r r a y [ i ] [ j ]
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 33
Python Example
Using the Python multiprocessing module, you can parallelise the data
loading and preprocessing steps.
p r o c e s s e s = [ m u l t i p r o c e s s i n g . P r o c e s s ( t a r g e t =load_and_preprocess_data ,
a r g s =( filename , queue ) ) f o r f i l e n a m e in f i l e n a m e s ]
f o r p in p r o c e s s e s :
p. start ()
f o r p in p r o c e s s e s :
p. join ()
Assignment Problem
Construct a Python script that loads and preprocesses a collection of
pictures in a simultaneous manner. Estimate the time taken to load
and preprocess the images and compare it to the sequential technique.
Summary
In this section, we have explored the fundamentals of parallel comput-
ing and memory hierarchies and their application to machine learn-
ing. We begin by summarising computer architecture and then dive
into the specifics of memory hierarchies, including registers, cache,
main memory, and secondary memory. We then compared shared and
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 34
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 35
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 36
ZenteiQ
Sashikumaar Ganesan ai mastery: the scalable ai & ml handbook 37
ZenteiQ