0% found this document useful (0 votes)
66 views88 pages

Scaler Numpy Notes

The document provides a comprehensive guide on using Numpy, covering installation, basic array creation, and operations such as indexing, slicing, and universal functions. It includes practical use cases like calculating the Net Promoter Score (NPS) and image manipulation, highlighting the advantages of Numpy over Python lists in terms of performance and memory efficiency. Additionally, it explains the underlying mechanics of Numpy arrays and their applications in data analysis and numerical computations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views88 pages

Scaler Numpy Notes

The document provides a comprehensive guide on using Numpy, covering installation, basic array creation, and operations such as indexing, slicing, and universal functions. It includes practical use cases like calculating the Net Promoter Score (NPS) and image manipulation, highlighting the advantages of Numpy over Python lists in terms of performance and memory efficiency. Additionally, it explains the underlying mechanics of Numpy arrays and their applications in data analysis and numerical computations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Content

• Installing and Importing Numpy

• Introduction to use case

• Motivation: Why to use Numpy? - How is it different from Python Lists?

• Creating a Basic Numpy Array


– From a List - array(), shape, ndim
– From a range and stepsize - arange()
– type() ndarray
• How numpy works under the hood?

• Indexing and Slicing on 1D


– Indexing
– Slicing
– Masking (Fancy Indexing)
• Operation on array

• Universal Functions (ufunc) on 1D array


– Aggregate Function/ Reduction functions - sum(), mean(), min(), max()
• Usecase: calculate NPS
– loading data: np.loadtxt()
– np.empty()
– np.unique()
• Reshape with -ve index

• Matrix Multiplication
– matmul(), @, dot()
• Vectorization
– np.vectorize()
• 3D arrays

• Use Case: Image Manipulation using Numpy


– Opening an Image
– Details of an image
– Visualizing Channels
– Rotating an Image (Transposing a Numpy Array)
– Trim image
– Saving ndarray as Image
• 2-D arrays (Matrices)
– reshape()
– 2 Questions
– Transpose
– Converting Matrix back to Vector - flatten()
• Indexing and Slicing on 2D

– Indexing

– Slicing

– Masking (Fancy Indexing)

• Universal Functions (ufunc) on 2D

– Aggregate Function/ Reduction functions - sum(), mean(), min(), max()

– Axis argument

– Logical Operations

– Sorting function - sort(), argsort()

• Use Case: Fitness Data analysis


– Loading data set and EDA using numpy
– np.argmax()
• Array splitting and Merging
– Splitting arrays - split(), hsplit(), vsplit()
– Merging Arrays - hstack(), vstack(), concatenate()
• Broadcasting
– np.tile()
• Dimension Expansion and Reduction
– np.expand_dims()
– np.newaxis
– np.sqeeze()
• Shallow vs Deep Copy
– view()
– copy()
– copy.deepcopy()

Installation Using %pip


!pip install numpy
Looking in indexes: https://pypi.org/simple, https://us-
python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-
packages (1.22.4)

Importing Numpy
• We'll import numpy as its alias name np for ease of typing
import numpy as np

Use Case: NPS (Net Promoter Score)


Imagine you are a Data Analyst @ Airbnb
You've been asked to analyze user survey data and report NPS to the management

But, what exactly is NPS?

Have you seen something like this ?


Link: https://drive.google.com/file/d/1-u8e-v_90JdikorKsKzBM-JJqoRtzsN8/view?usp=sharing

This is called Likelyhood to Recommend Survey

• Responses are given a scale ranging from 0–10,


– with 0 labeled with “Not at all likely,” and
– 10 labeled with “Extremely likely.”

Based on this, we calculate the Net Promoter score

How to calculate NPS score?

We label our responses into 3 categories:

• Detractors: Respondents with a score of 0-6


• Passive: Respondents with a score of 7-8
• Promoters: score of 9-10.

And

Net Promoter score = % Promoters - % Detractors.


How is NPS helpful?

Why would we want to analyse the survey data for NPS?


NPS helps a brand in gauging its brand value and sentiment in the market.

• Promoters are highly likely to recommend your product or sevice. Hence, bringing in
more business
• whereas, Detractors are likely to recommend against your product or service’s usage.
Hence, bringing the business down.

These insights can help business make customer oriented decision along with product
improvisation.

Two third of Fortune 500 companies use NPS

Lets first look at the data we have gathered


Dataset: https://drive.google.com/file/d/1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK/view?
usp=sharing

Notice that the file contains the score for likelyhood to recommend survey
Using NumPy

• we will bin our data into promoters/detractors


• calulate the percentage of promoters/detractors
• calculate NPS

Why use Numpy?


Suppose you are given a list of numbers and you have to find square of each number and store it
in original list.

a = [1,2,3,4,5]

Solution: Basic approach iterate over the list and square each element

a = [i**2 for i in a]
print(a)

[1, 4, 9, 16, 25]

Lets try the same operation with NumPy


a = np.array([1,2,3,4,5])
print(a**2)

[ 1 4 9 16 25]
The biggest benefit of NumPy is that it supports element-wise operation

Notice how easy and clean is the syntax.

But is the clean syntax and ease in writing the only benefit we are getting here?
• To understand this, lets time these operations
• We will use %timeit to measure the time for operations
l = range(1000000)

%timeit [i**2 for i in l]

546 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It took approx 300 ms sec per loop to iterate and square all elements from 0 to 999,999

Let's peform same operation using numpy arrays

• We will use np.array() method for this.


• np.array() simply converts a python array to numpy array.
• We can peform element wise operation using numpy
l = np.array(range(1000000))

%timeit l**2

797 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice per loop time for numpy operation: 1.46 micro sec

What is the major reason behind numpy's faster computation?

• The numpy array is densely packed in memory due to it's homogenous type.
• Numpy is able to divide a task into multiple subtasks and process them parallelly.
• Numpy functions are implemented in C. Which again makes it faster compared to Python
Lists.

What is the takeaway from this exercise?


• NumPy provides clean syntax for providing element-wise operations
• Per loop time for numpy to perform operation is much lesser than list

Infact, Numpy is one of the most important packages for performing numerical computations

Why?

Most of computations in DS/ML/DA can be broken down into element-wise operations

Let's create some basic arrays in NumPy


First method we'll see in Numpy is array()
• We pass a Python list into np.array()
• It converts that Python list into a numpy array

# Let's create a 1-D array


arr1 = np.array([1, 2, 3])
print(arr1)
print(arr1 * 2)

[1 2 3]
[2 4 6]

• This is NOT a normal Python list


• It's a numpy array - supports element-wise operation

Question: What will be the dimension of this array?


1 coz it is a 1D array.

We can get the dimension of array using ndim property

arr1.ndim

Numpy arrays have an other property called shape which can tell us number of
elements across every dimension
We can also get the shape of the array.

arr1.shape

(3,)

Let's take another example to understand shape and ndim better

arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])


print(arr2)

[[ 1 2 3]
[ 4 5 6]
[10 11 12]]

What do you think will be the dimension of this 2D array?


arr2.ndim

And what about the shape?


arr2.shape
(3, 3)

Lets create some sequences in Numpy


From a range and stepsize - arange()
• np.arange()

• Similar to range()

• We can pass starting point, ending point (not included in array) and step-size

• arange(start, end, step)

arr2 = np.arange(1, 5)
arr2

array([1, 2, 3, 4])

arr2_stepsize = np.arange(1, 5, 2)
arr2_stepsize

array([1, 3])

• np.arange() behaves in same way as range() function

But then why not call it np.range?


• In np.arange(), we can pass a floating point number as step-size
arr3 = np.arange(1, 5, 0.5)
arr3

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

Lets check the type of a Numpy array


type(arr1)

numpy.ndarray

But why are we calling it an array? Why not a NumPy list?

How numpy works under the hood?


• It's a Python Library, will write code in Python to use numpy

However, numpy itself is written in C


Allows numpy to manage memory very efficiently
But why is C arrays more efficient or faster than Python Lists?

• In Python List, we can store objects of different types together - int, float, string,
etc.

• The actual values of objects are stored somewhere else in the memory

• Only References to those objects (R1, R2, R3, ...) are stored in the Python List.

• So, when we have to access an element in Python List, we first access the
reference to that element and then that reference allows us to access the value of
element stored in memory

C array does all this in one step


• C array stores objects of same data type together

• Actual values are stored in same contiguous memory


• So, when we have to access an element in C array, we access it directly using
indices.

BUT, notice that this would make NumPy array lose the flexibility to store
heterogenous data
==> Unlike Python lists, NumPy array can only hold contigous data

• So numpy arrays are NOT really Python lists


• They are basically C arrays

Let's further see the C type behaviour of Numpy


• For this, lets pass a floating point number as one of the values in np array
arr4 = np.array([1, 2, 3, 4])
arr4

array([1, 2, 3, 4])

arr4 = np.array([1, 2, 3, 4.0])


arr4

array([1., 2., 3., 4.])

• Notice that int is raised to float


• Because one single C array can store values of only one data type i.e. homogenous data
• If you press "Shift+tab" inside np.array() function

• You can see function's signature


– name
– input parameters
– default values of input parameters
• Look at dtype=None
– dtype means data-type
– which is set to None by default

What if we set dtype to float?


arr5 = np.array([1, 2, 3, 4])
arr5

array([1, 2, 3, 4])

arr5 = np.array([1, 2, 3, 4], dtype="float")


arr5

array([1., 2., 3., 4.])


Conclusion:
• "nd" in ndarray stands for n-dimensional - ndarray means an n-dimensional array

Indexing and Slicing upon Numpy arrays


m1 = np.arange(12)
m1

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Indexing in np arrays
• Works same as lists
m1[0] # gives first element of array

m1[6] # out of index Error

Question: What will be th output of m1[-1] ?


m1[-1]

11

Numpy also supports negative indexing.

You can also use list of indexes in numpy


m1 = np.array([100,200,300,400,500,600])

m1[[2,3,4,1,2,2]]

array([300, 400, 500, 200, 300, 300])

Did you notice how single index can be repeated multiple times when giving list of indexes?

Slicing
• Similar to Python lists
• We can slice out and get a part of np array
• Can also mix Indexing and Slicing
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

m1[:5]
array([1, 2, 3, 4, 5])

Question: What'll be output of arr[-5:-1]?


m1[-5:-1]

array([6, 7, 8, 9])

Question: What'll be the output for arr[-5:-1: -1] ?


m1[-5: -1: -1]

array([], dtype=int64)

Fancy indexing (Masking)


• Numpy arrays can be indexed with boolean arrays (masks).
• This method is called fancy indexing.

What would happen if we do this?


m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1 < 6

array([ True, True, True, True, True, False, False, False, False,
False])

• Comparison operation also happens on each element

• All the values before 6 return True and all values after 6 return False

Now, Let's use this to filter or mask values from our array
• Condition will be passed instead of indices and slice ranges
m1[m1 < 6]

array([1, 2, 3, 4, 5])

Notice that,

• Value corresponding to True is retained


• Value corresponding to False is filtered out

This is similar to filtering using filter() function


filter(lambda x: x < 6, [...]) Refer Notes
How can we filter/mask even values from our array?
m1[m1%2 == 0]

array([ 2, 4, 6, 8, 10])
m1[m1%2==0].shape

(5,)

Question: Multiple conditions in numpy

Given an array of elements from 0 to 10, filter the elements which are
multiple of 2 or 5.

a = [0,1,2,3,4,5,6,7,8,9,10]

output should be [0,2,4,5,6,8,10]

a = np.arange(11)

a[(a %2 == 0) | (a%5 == 0)]

array([ 0, 2, 4, 5, 6, 8, 10])

(Optional) Why do we use & , | instead of and, or keywords for writing multiple
condition ?
The difference is that

• and and or gauge the truth of whole object, whereas

• & and | are bitwise operator and perform operation on each bit

Recall that everything is treated as object in python.

So, when we use and or or,

• Python will treat object as single Boolean entity.


bool(42)

True

bool(0)

False

bool(42 or 0)

True

bool(42 and 0)

False

Now, when we apply & and |, it does bitwise and and or instead of doing on whole object.
bin(42)

{"type":"string"}

bin(50)

{"type":"string"}

bin(42 & 50)

{"type":"string"}

bin(42 | 50)

{"type":"string"}

Notice that the bits of objects are being compared to get the result.

In similar fashion, you can think of numpy array with boolean values as string of bits

• where 1 = True
• and 0 = False
import numpy as np

arr = np.array([1, 0, 1, 0, 1, 0], dtype = bool)


arr1 = np.array([1, 1, 0, 0, 1, 0], dtype =bool)

arr

array([ True, False, True, False, True, False])

arr1

array([ True, True, False, False, True, False])

arr | arr1

array([ True, True, True, False, True, False])

Using and or or on arrays will try to evaulate the condition on entire array which is not defined

(as numpy is made for element wise operation)

arr and arr1

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-53-2ae36cd9a0b9> in <module>
----> 1 arr and arr1
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()

(Optional) Now, What is the dtype of mask?


It is a boolean array. Hence, it can be treated as string of bits and hence, we use & and | operator
on it

(Optional) But why do we use () when using multiple conditions?


Remember that the precedence of &, | is more than >, <, ==.

Let's take an example:


a %2 == 0 | a%5 == 0

In above mask, it'll end up evaluating 0 | a&5 first which will throw an error.

Operations on Numpy Arrays


We have already seen operations of a Numpy array and a scalar (single value)

arr = np.arange(4)
arr

array([0, 1, 2, 3])

arr + 3

array([3, 4, 5, 6])

Lets see some algerbraic operations on two arrays


# Corresponding elements of arrays get added
a = np.array([1, 2, 3])
b = np.array([2, 2, 2])
a + b

array([3, 4, 5])

# Corresponding elements of arrays get multiplied


a * b

array([2, 4, 6])

Question: What will be the output of the following ?


a = np.array([0,2,3])
b = np.array([1,3,5])

a*b
array([ 0, 6, 15])

Numpy will do element wise multiplication

Aggregate / Universal Functions on 1D array (ufunc)


Numpy provides various universal functions that cover a wide variety of operations.

For example:
• When addition of constant to array is performed element-wise using + operator, then
np.add() is called internally.
import numpy as np

a = np.array([1,2,3,4])

a+2 # ufunc `np.add()` called automatically

array([3, 4, 5, 6])

np.add(a,2)

array([3, 4, 5, 6])

• These functions operate on ndarray (N-dimensional array) i.e Numpy’s array


class.

• They perform fast element-wise array operations.

Aggregate Functions/ Reduction functions


Now, how would calculate the sum of elements of an array?
np.sum()
• It sums all the values in np array
a = np.arange(1, 11)
a

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

np.sum(a) # sums all the values present in array

55

Now, What if we want to find the average value or median value of all the elements
in an array?
np.mean()
• np.mean() gives mean of all values in np array
np.mean(a)

5.5

Now, we want to find the minimum value in the array


np.min() function can help us with this

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

np.min(a)

We can also find max elements in an array.


np.max() function will give us maximum value in the array

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

np.max(a) # maximum value

10

Usecase: NPS
Let's first download the dataset
import numpy as np

!gdown 1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK

Downloading...
From: https://drive.google.com/uc?id=1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK
To: /content/survey.txt
0% 0.00/2.55k [00:00<?, ?B/s] 100% 2.55k/2.55k [00:00<00:00,
4.80MB/s]

Let's load the data we saw earlier. For this we will use .loadtxt() function
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

score = np.loadtxt('survey.txt', dtype ='int')

We provide file name along with the dtype of data we want to load in

Let's see what the data looks like


score[:5]

array([ 7, 10, 5, 9, 9])

Let's check the number of responses


score.shape

(1167,)

There are a total of 1167 responses for the LTR survey

Let's perform some sanity check on data


Let's check the minimum and max value in array

score.min()

score.max()

10

Looks like, there are no records with 0 score.

Now, let's calculate NPS using these response.

NPS = % Promoters - % Detractors

Now, in order to calculate NPS, we need to calculate two things:

• % Promoters
• % Detractors

In order to calculate % Promoters and % Detractors, we need to get the count of promoter as
well as detractor.

Question: How can we get the count of Promoter/ Detractor ?


We can do so by using fancy indexing (masking )

Let's get the count of promoter and detractors


Detractors have a score <=6

detractors = score[score <= 6].shape[0]

total = score.shape[0]

percent_detractors = detractors/total*100

percent_detractors
28.449014567266495

Similarly, Promoters have a score 9-10

promoters = score[score >= 9].shape[0]

percent_promoters = promoters/total*100

percent_promoters

52.185089974293064

Calculating NPS
For calculating NPS, we need to
% promoters - % detractors

nps = percent_promoters - percent_detractors


nps

23.73607540702657

np.round(nps)

24.0

Now, there are two types of data:

• Numerical (we have seen so far)


• Categorical (in form of categories)

For example:

• An array of Blood pressure status for a patient on various days.


• It can have values 'Low', 'Good', 'High'

Similarly,

• An array of workout data which contains muscle area impacted from muscle training
• Values can be core, legs, shoulder, back

Similarly, We will map our scores into 3 categories s.t:

• 0 - 6: Detractors

• 7 - 8: Passive

• 9 - 10: Prometers
This process is called binning
But, why binning?
Binning helps us reduce the number of unique values.

• simplifying the data without any significant loss of info.


• helps in quick absorption of information
• also helps in visualization (will be discussed later)
• also helps in simplyfying inputs ML models (hence, reducing computational complexity)

How'll we bin our data ?

Will this work ?


score[score <= 6] = 'Detractors'

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-86-b1629d66b9e5> in <module>
----> 1 score[score <= 6] = 'Detractors'

ValueError: invalid literal for int() with base 10: 'Detractors'

Why didn't the above code work?


Recall the array are of homogenours datatype

• dtype of our array is int

We are trying to assign string to int array; Hence, it is throwing an error

So, what do we do?


What if we create an array of same length as score array and assign values to new array based
on values present in score array.

How do we initialize new array based on length of preexisting array ?


Numpy provides us with a method to initialize empty array : np.empty()

It takes the following arguments:

• shape
• dtype

Question: What will be the shape and dtype of new array ?


arr = np.empty(shape = score.shape, dtype = 'str')

arr

array(['', '', '', ..., '', '', ''], dtype='<U1')


Notice the following

• All the elements of the array are empty string


• But, the dtype is being shown as U1.

Didn't we initialize the dtype as string?

Why is the dtype being shown as <U1 ?


U1 means Unicode string of length 1.

Whenever we initialize the array with str datatype, it automatically initializes it of type Unicode
string with length 1.

Question: What will happen in following case? Will the string be assigned to the 0th
index ?
arr[0] = 'hellow'

arr

array(['h', '', '', ..., '', '', ''], dtype='<U1')

Notice that,

• as the length is defined as 1


• it automatically truncates the rest of string and only stores the first character.

But, we want to store whole string 'Detractor/Promoter/Passive'.

How do we change the cap on length of string ?


We can specify the length of string while initializing the array.

arr = np.empty(shape = score.shape, dtype = 'U10')

arr

array(['', '', '', ..., '', '', ''], dtype='<U10')

arr.shape

(1167,)

Instead of specifying the dtype as str, we initialize it as Un where n is the number of characters

Now, we have got a string array. Let's bin our score values

arr[score <= 6] = 'detractors'

arr

array(['', '', 'detractors', ..., 'detractors', '', ''], dtype='<U10')


Similarly, we can do it for passive and promoters

arr[(score >= 7) & (score <= 8)] = 'passive'

arr[score >= 9] = 'promoters'

arr

array(['passive', 'promoters', 'detractors', ..., 'detractors',


'promoters', 'promoters'], dtype='<U10')

arr[:15]

array(['passive', 'promoters', 'detractors', 'promoters', 'promoters',


'detractors', 'passive', 'promoters', 'promoters', 'promoters',
'promoters', 'detractors', 'promoters', 'promoters',
'passive'],
dtype='<U10')

Now, we have array with desired values.

How do we count the number of instance for each value ?


There are two ways of doing it.

Let's look at long way first.

We do fancy indexing for each unique value and get the shape

detractors_count = arr[arr == 'detractors'].shape[0]

detractors_count

332

passive_count = arr[arr == 'passive'].shape[0]


passive_count

226

promoters_count = arr[arr == 'promoters'].shape[0]


promoters_count

609

Now, there's a short way as well.

Numpy provides us a function .unique() to get unique element

np.unique(arr)

array(['detractors', 'passive', 'promoters'], dtype='<U10')


But we want the count of each unique element.

For this, we can pass argument return_counts = True

np.unique(arr, return_counts = True)

(array(['detractors', 'passive', 'promoters'], dtype='<U10'),


array([332, 226, 609]))

unique, counts = np.unique(arr, return_counts = True)

unique

array(['detractors', 'passive', 'promoters'], dtype='<U10')

counts

array([332, 226, 609])

Now, let's calculate the percent of promoters and detractors

% Promoters
percent_promoters = counts[2]/counts.sum()*100

% Detractors
percent_detractors = counts[0]/counts.sum()*100

Calculating NPS
For calculating NPS, we need to
% promoters - % detractors

nps = percent_promoters - percent_detractors


nps

23.73607540702657

np.round(nps)

24.0

(Optional) What is a good NPS score ?

Source: https://chattermill.com/blog/what-is-a-good-nps-score/

(Optional) Industry wise NPS benchmark


Use Case: Fitbit
Imagine you are a Data Scientist at Fitbit
You've been given a user data to analyse and find some insights which can be shown on the
smart watch.

But why would we want to analyse the user data for desiging the watch?
These insights from the user data can help business make customer oriented decision for the
product design.

Lets first look at the data we have gathered


Link: https://drive.google.com/file/d/1Uxwd4H-tfM64giRS1VExMpQXKtBBtuP0/view?
usp=sharing

Notice that there are some user features in the data


There are provided as various columns in the data.

Every row is called a record or data point

What are all the features provided to us?


• Date
• Step Count
• Mood (Categorical)
• Calories Burned
• Hours of sleep
• Feeling Active (Categorical)

Using NumPy, we will explore this data to look for some interesting insights - Exploratory
Data Analysis.

EDA is all about asking the right questions

What kind of questions can we answer using this data?


• How many records and features are there in the dataset?
• What is the average step count?
• On which day the step count was highest/lowest?

Can we find some deeper insights?


We can probably see how daily activity affects sleep and moood.

We will try finding


• How daily activity affects mood?
import numpy as np

Working with 2-D arrays (Matrices)


Question : How do we create a matrix using numpy?
m1 = np.array([[1,2,3],[4,5,6]])
m1
# Nicely printing out in a Matrix form

array([[1, 2, 3],
[4, 5, 6]])

How can we check shape of a numpy array?

m1.shape # arr1 has 3 elements

(2, 3)

Question: What is the type of this result of arr1.shape? Which data structure is this?
Tuple

Now, What is the dimension of this array?


m1.ndim

Question

a = np.array([[1,2,3],
[4,5,6],
[7,8,9]])

b = len(a)

What'll be the value of b?

Ans: 3

Explanation: len(nD array) will give you magnitude of first dimension

a = np.array([[1,2,3],
[4,5,6],
[7,8,9]])

a
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

len(a)

What will be the shape of array a?


a.shape

(3, 3)

• So, it is a 2-D array with 3 rows and 3 columns

Clearly, if we have to create high-dimensional arrays, we cannot do this using np.arange()


directly

How can we create high dimensional arrays?


• Using reshape()

For a 2D array

• First argument is no. of rows


• Second argument is no. of columns
m2 = np.arange(1, 13)
m2

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

• We can pass the desired dimensions of array in reshape()

In what ways can we convert this array with 12 values into high-dimensional array?

Can we make m2 a 4 × 4 array?


• Obviously NO
• 4 × 4 requires 16 values, but we only have 12 in m2
m2 = np.arange(1, 13)
m2.reshape(4, 4)

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-122-fc70b006b379> in <module>
1 m2 = np.arange(1, 13)
----> 2 m2.reshape(4, 4)
ValueError: cannot reshape array of size 12 into shape (4,4)

So, What are the ways in which we can reshape it?


• 4 ×3
• 3×4
• 6 ×2
• 2 ×6
• 1 ×12
• 12 ×1
m2 = np.arange(1, 13)
m2.reshape(4, 3)

array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])

m2 = np.arange(1, 13)
m2

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

m2.shape

(12,)

Lets do some reshaping here


m2.reshape(12, 1)

array([[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12]])

Now, What's the difference b/w (12,) and (12, 1)?


• (12,) means its a 1D array
• (12, 1) means its a 2D array
Question
What will be output for the following code?

a = np.array([[1,2,3],[0,1,4]])
print(a.ndim)

Ans: 2

a = np.array([[1,2,3],[0,1,4]])
print(a.ndim)

Since it is a 2 dimensional array, the number of dimension will be 2.

Transpose
• Change rows into columns and columns into rows

• Just use <Matrix>.T

a = np.arange(3)
a

array([0, 1, 2])

a.T

array([0, 1, 2])

Why did Transpose did not work?


• Because numpy sees a as a vector (3,), NOT a matrix

• We'll have to reshape the vector a to make it a matrix

a = np.arange(3).reshape(1, 3)
a
# Now a has dimensions (1, 3) instead of just (3,)
# It has 1 row and 3 columns

array([[0, 1, 2]])

a.T
# It has 3 rows and 1 column

array([[0],
[1],
[2]])
Conclusion
• Transpose works only on matrices

Flattening of an array
What if we want to convert this 2D or nD array back to 1D array?
There is a function named flatten() to help you do so.

A = np.arange(12).reshape(3, 4)
A

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

A.flatten()

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Indexing and Slicing on 2D Numpy arrays


Indexing in np arrays
• Works same as lists
m1 = np.arange(1,10).reshape((3,3))

m1

array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

m1[1][2]

OR

• We just use [0, 0] (indexes separated by commas)

What will be the output of this?


m1[1, 1] #m1[row, column]

We saw how we can use list of indexes in numpy array


m1 = np.array([100,200,300,400,500,600])
m1[[2,3,4,1,2,2]]

array([300, 400, 500, 200, 300, 300])

How'll list of indexes work in 2D array ?

m1 = np.arange(9).reshape((3,3))

m1

array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])

m1[[0,1,2],[0,1,2]] # picking up element (0,0), (1,1) and (2,2)

array([0, 4, 8])

Slicing
• Need to provide two slice ranges - one for row and one for column
• Can also mix Indexing and Slicing
m1 = np.arange(12).reshape(3,4)
m1

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

m1[:2] # gives first two rows

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

How can we get columns from 2D array?


m1[:, :2] # gives first two columns

array([[0, 1],
[4, 5],
[8, 9]])

Question: Given an 2-D array

m1 = [[0,1,2,3],
[4,5,6,7],
[8,9,10,11]]

m1 = m1.reshape((3,4))
Question for you: Can you just get this much of our array m1?
[[5, 6],
[9, 10]]

Remember our m1 is:


m1 = [[0, 1, 2, 3],
[4, 5, 6, 7],
[8, 9, 10, 11]]

# First get rows 1 to all


# Then get columns 1 to 3 (not included)
m1[1:, 1:3]

array([[ 5, 6],
[ 9, 10]])

Question: What if I need 1st and 3rd column?


[[1, 3],
[5, 7],
[9,11]]

# Get all rows


# Then get columns from 1 to all with step of 2

m1[:, 1::2]

array([[ 1, 3],
[ 5, 7],
[ 9, 11]])

• We can also pass indices of required columns as a Tuple to get the same result
# Get all rows
# Then get columns 1 and 3

m1[:, (1,3)]

array([[ 1, 3],
[ 5, 7],
[ 9, 11]])

Fancy indexing (Masking)

What would happen if we do this?


m1 = np.arange(12).reshape(3, 4)
m1 < 6
array([[ True, True, True, True],
[ True, True, False, False],
[False, False, False, False]])

• A matrix having boolean values True and False is returned

• We can use this boolean matrix to filter our array

Now, Let's use this to filter or mask values from our array
• Condition will be passed instead of indices and slice ranges
m1[m1 < 6]
# Value corresponding to True is retained
# Value corresponding to False is filtered out

array([0, 1, 2, 3, 4, 5])

How can we filter/mask even values from our array?


m1[m1%2 == 0]

array([ 0, 2, 4, 6, 8, 10])

But did you notice that matrix gets converted into a 1D array after masking?
m1

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

m1[m1%2 == 0]

array([ 0, 2, 4, 6, 8, 10])

It happens because
• To retain matrix shape, it has to retain all the elements
• It cannot retain its 3 × 4 with lesser number of elements
• So, this filtering operation implicitly converts high-dimensional array into 1D array

If we want, we can reshape the resulting 1D array into 2D


• But, we need to know beforehand what is the dimension or number of elements in
resulting 1D array
m1[m1%2==0].shape

(6,)

m1[m1%2==0].reshape(2, 3)
array([[ 0, 2, 4],
[ 6, 8, 10]])

Universal Functions (ufunc) on 2D & Axis


Aggregate Functions/ Reduction functions
We saw how aggregate functions work on 1D array in last class

arr = np.arange(3)
arr

array([0, 1, 2])

arr.sum()

Let's apply Aggregate functions on 2D array


np.sum()
a = np.arange(12).reshape(3, 4)
a

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

np.sum(a) # sums all the values present in array

66

What if we want to do the elements row-wise or column-wise?


• By setting axis parameter

What will np.sum(a, axis=0) do?


• np.sum(a, axis=0) adds together values in DIFFERENT rows
• axis = 0 ---> Changes will happen along the vertical axis
• Summing of values happen in the vertical direction
• Rows collapse/merge when we do axis=0
np.sum(a, axis=0)

array([12, 15, 18, 21])

Now, What if we specify axis=1?


• np.sum(a, axis=1) adds together values in DIFFERENT columns
• axis = 1 ---> Changes will happen along the horizontal axis
• Summing of values happen in the horizontal direction
• Columns collapse/merge when we do axis=1
np.sum(a, axis=1)

array([ 6, 22, 38])

Now, What if we want to find the average value or median value of all the elements
in an array?
np.mean(a) # no need to give any axis

5.5

What if we want to find the mean of elements in each row or in each column?
• We can do same thing with axis parameter like we did for np.sum() function

Question: Now you tell What will np.mean(a, axis=0) give?


• It will give mean of values in DIFFERENT rows
• axis = 0 ---> Changes will happen along the vertical axis
• Mean of values will be calculated in the vertical direction
• Rows collapse/merge when we do axis=0
np.mean(a, axis=0)

array([4., 5., 6., 7.])

How can we get mean of elements in each column?


• np.mean(a, axis=1) will give mean of values in DIFFERENT columns
• axis = 1 ---> Changes will happen along the horizontal axis
• Mean of values will be calculated in the horizontal direction
• Columns collapse/merge when we do axis=1
np.mean(a, axis=1)

array([1.5, 5.5, 9.5])

Now, we want to find the minimum value in the array


np.min() function can help us with this

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

np.min(a)
0

What if we want to find row wise minimum value?

Use axis argument!!


np.min(a, axis = 1 )

array([0, 4, 8])

We can also find max elements in an array.


np.max() function will give us maximum value in the array

We can also use axis argument to find row wise/ column wise max.

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

np.max(a) # maximum value

11

np.max(a, axis = 0) # column wise max

array([ 8, 9, 10, 11])

Logical Operations
Now, What if we want to check whether "any" element of array follows a specific
condition?

Let's say we have 2 arrays:


a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
a, b

(array([1, 2, 3, 4]), array([4, 3, 2, 1]))

Let's say we want to find out if any of the elements in array a is smaller than any of
the corresponding elements in array b

np.any() can become handy here as well


• any() returns True if any of the corresponding elements in the argument arrays follow
the provided condition.
a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
np.any(a<b) # Atleast 1 element in a < corresponding element in b

True

Let's try the same condition with different arrays:


a = np.array([4,5,6,7])
b = np.array([4,3,2,1])
np.any(a<b) # All elements in a >= corresponding elements in b

False

• In this case, NONE of the elements in a were smaller than their corresponding
elements in b

• So, np.any(a<b) returned False

What if we want to check whether "all" the elements in our array are non-zero or
follow the specified condition?
np.all()

Now, What if we want to check whether "all" the elements in our array follow a
specific condition?

Let's say we want to find out if all the elements in array a are smaller than all the
corresponding elements in array b
Again, Let's say we have 2 arrays:

a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
a, b

(array([1, 2, 3, 4]), array([4, 3, 2, 1]))

np.all(a<b) # Not all elements in a < corresponding elements in b

False

Let's try it with different arrays


a = np.array([1,0,0,0])
b = np.array([4,3,2,1])
np.all(a<b) # All elements in a < corresponding elements in b

True
• In this case, ALL the elements in a were smaller than their corresponding
elements in b

• So, np.all(a<b) returned True

Multiple conditions for .all() function


a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
((a <= b) & (b <= c)).all()

True

What if we want to update an array based on condition ?


Suppose you are given an array of integers and you want to update it based on following
condition:

• if element is > 0, change it to +1


• if element < 0, change it to -1.

How will you do it ?


arr = np.array([-3,4,27,34,-2, 0, -45,-11,4, 0 ])
arr

array([ -3, 4, 27, 34, -2, 0, -45, -11, 4, 0])

You can use masking to update the array (as discussed in last class)

arr[arr > 0] = 1
arr [arr < 0] = -1

arr

array([-1, 1, 1, 1, -1, 0, -1, -1, 1, 0])

There is a numpy function which can help us with it.

np.where()
Function signature: np.where(condition, [x, y])

This functions returns an ndarray whose elements are chosen from x or y depending on
condition.

arr = np.array([-3,4,27,34,-2, 0, -45,-11,4, 0 ])

np.where(arr > 0, +1, -1)

array([-1, 1, 1, 1, -1, -1, -1, -1, 1, -1])


arr

array([ -3, 4, 27, 34, -2, 0, -45, -11, 4, 0])

Notice that it didn't change the original array.

Sorting Arrays
• We can also sort the elements of an array along a given specified axis

• Default axis is the last axis of the array.


np.sort()
a = np.array([2,30,41,7,17,52])
a

array([ 2, 30, 41, 7, 17, 52])

np.sort(a)

array([ 2, 7, 17, 30, 41, 52])

array([ 2, 30, 41, 7, 17, 52])

Let's work with 2D array

a = np.arange(9,0,-1).reshape(3,3)
a

array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])

Question: What will be the result when we sort using axis = 0 ?


np.sort(a, axis = 0)

array([[3, 2, 1],
[6, 5, 4],
[9, 8, 7]])

Recall that when axis =0

• change will happen along vertical axis.

Hence, it will sort out row wise.

a
array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])

• Original array is still the same. It hasn't changed

np.argsort()
• Returns the indices that would sort an array.

• Performs an indirect sort along the given axis.

• It returns an array of indices of the same shape as a that index data along the
given axis in sorted order.

a = np.array([2,30,41,7,17,52])
a

array([ 2, 30, 41, 7, 17, 52])

np.argsort(a)

array([0, 3, 4, 1, 2, 5])

As you can see:


• The orginal indices of elements are in same order as the orginal elements would be in
sorted order

Use Case: Fitness data analysis


Let's first download the dataset
!gdown 1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd

Downloading...
From: https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
To: /content/fit.txt
0% 0.00/3.43k [00:00<?, ?B/s] 100% 3.43k/3.43k [00:00<00:00,
6.65MB/s]

Let's load the data we saw earlier. For this we will use .loadtxt()
function
data = np.loadtxt('fit.txt', dtype='str')

We provide file name along with the dtype of data we want to load in

data[:5]
array([['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],
['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'],
['08-10-2017', '25', 'Sad', '0', '5', 'Inactive'],
['09-10-2017', '5461', 'Sad', '174', '4', 'Inactive'],
['10-10-2017', '6915', 'Neutral', '223', '5', 'Active']],
dtype='<U10')

What's the shape of the data?

data.shape

(96, 6)

There are 96 records and each record has 6 features. These features are:

• Date
• Step count
• Mood
• Calories Burned
• Hours of sleep
• activity status

Notice that above array is a homogenous containing all the data as strings
In order to work with strings, categorical data and numerical data, we will have save every
feature seperately

How will we extract features in seperate variables?


We can get some idea on how data is saved.

Lets see whats the first element of data

data[0]

array(['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],


dtype='<U10')

Hm, this extracts a row not a column

Think about it.

Whats the way to change columns to rows and rows to columns?


Transpose

data.T[0]

array(['06-10-2017', '07-10-2017', '08-10-2017', '09-10-2017',


'10-10-2017', '11-10-2017', '12-10-2017', '13-10-2017',
'14-10-2017', '15-10-2017', '16-10-2017', '17-10-2017',
'18-10-2017', '19-10-2017', '20-10-2017', '21-10-2017',
'22-10-2017', '23-10-2017', '24-10-2017', '25-10-2017',
'26-10-2017', '27-10-2017', '28-10-2017', '29-10-2017',
'30-10-2017', '31-10-2017', '01-11-2017', '02-11-2017',
'03-11-2017', '04-11-2017', '05-11-2017', '06-11-2017',
'07-11-2017', '08-11-2017', '09-11-2017', '10-11-2017',
'11-11-2017', '12-11-2017', '13-11-2017', '14-11-2017',
'15-11-2017', '16-11-2017', '17-11-2017', '18-11-2017',
'19-11-2017', '20-11-2017', '21-11-2017', '22-11-2017',
'23-11-2017', '24-11-2017', '25-11-2017', '26-11-2017',
'27-11-2017', '28-11-2017', '29-11-2017', '30-11-2017',
'01-12-2017', '02-12-2017', '03-12-2017', '04-12-2017',
'05-12-2017', '06-12-2017', '07-12-2017', '08-12-2017',
'09-12-2017', '10-12-2017', '11-12-2017', '12-12-2017',
'13-12-2017', '14-12-2017', '15-12-2017', '16-12-2017',
'17-12-2017', '18-12-2017', '19-12-2017', '20-12-2017',
'21-12-2017', '22-12-2017', '23-12-2017', '24-12-2017',
'25-12-2017', '26-12-2017', '27-12-2017', '28-12-2017',
'29-12-2017', '30-12-2017', '31-12-2017', '01-01-2018',
'02-01-2018', '03-01-2018', '04-01-2018', '05-01-2018',
'06-01-2018', '07-01-2018', '08-01-2018', '09-01-2018'],
dtype='<U10')

Great, we could extract first column

Lets extract all the columns and save them in seperate variables
date, step_count, mood, calories_burned, hours_of_sleep,
activity_status = data.T

step_count

array(['5464', '6041', '25', '5461', '6915', '4545', '4340', '1230',


'61',
'1258', '3148', '4687', '4732', '3519', '1580', '2822', '181',
'3158', '4383', '3881', '4037', '202', '292', '330', '2209',
'4550', '4435', '4779', '1831', '2255', '539', '5464', '6041',
'4068', '4683', '4033', '6314', '614', '3149', '4005', '4880',
'4136', '705', '570', '269', '4275', '5999', '4421', '6930',
'5195', '546', '493', '995', '1163', '6676', '3608', '774',
'1421',
'4064', '2725', '5934', '1867', '3721', '2374', '2909', '1648',
'799', '7102', '3941', '7422', '437', '1231', '1696', '4921',
'221', '6500', '3575', '4061', '651', '753', '518', '5537',
'4108',
'5376', '3066', '177', '36', '299', '1447', '2599', '702',
'133',
'153', '500', '2127', '2203'], dtype='<U10')
step_count.dtype

dtype('<U10')

Notice the data type of step_count and other variables. It's a string type where U means Unicode
String. and 10 means 10 bytes.

Why? Because Numpy type-casted all the data to strings.

Let's convert the data types of these variables


Step Count

step_count = np.array(step_count, dtype = 'int')


step_count.dtype

dtype('int64')

step_count

array([5464, 6041, 25, 5461, 6915, 4545, 4340, 1230, 61, 1258,
3148,
4687, 4732, 3519, 1580, 2822, 181, 3158, 4383, 3881, 4037,
202,
292, 330, 2209, 4550, 4435, 4779, 1831, 2255, 539, 5464,
6041,
4068, 4683, 4033, 6314, 614, 3149, 4005, 4880, 4136, 705,
570,
269, 4275, 5999, 4421, 6930, 5195, 546, 493, 995, 1163,
6676,
3608, 774, 1421, 4064, 2725, 5934, 1867, 3721, 2374, 2909,
1648,
799, 7102, 3941, 7422, 437, 1231, 1696, 4921, 221, 6500,
3575,
4061, 651, 753, 518, 5537, 4108, 5376, 3066, 177, 36,
299,
1447, 2599, 702, 133, 153, 500, 2127, 2203])

Calories Burned

calories_burned = np.array(calories_burned, dtype = 'int')


calories_burned.dtype

dtype('int64')

Hours of Sleep

hours_of_sleep = np.array(hours_of_sleep, dtype = 'int')


hours_of_sleep.dtype
dtype('int64')

Mood

Mood is a categorical data type. As a name says, categorical data type has two or more
categories in it.

Let's check the values of mood variable

mood

array(['Neutral', 'Sad', 'Sad', 'Sad', 'Neutral', 'Sad', 'Sad', 'Sad',


'Sad', 'Sad', 'Sad', 'Sad', 'Happy', 'Sad', 'Sad', 'Sad',
'Sad',
'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral',
'Neutral',
'Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy',
'Happy', 'Happy', 'Neutral', 'Happy', 'Happy', 'Happy',
'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Neutral',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Neutral', 'Sad', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Sad', 'Neutral',
'Neutral',
'Sad', 'Sad', 'Neutral', 'Neutral', 'Happy', 'Neutral',
'Neutral',
'Sad', 'Neutral', 'Sad', 'Neutral', 'Neutral', 'Sad', 'Sad',
'Sad',
'Sad', 'Happy', 'Neutral', 'Happy', 'Neutral', 'Sad', 'Sad',
'Sad',
'Neutral', 'Neutral', 'Sad', 'Sad', 'Happy', 'Neutral',
'Neutral',
'Happy'], dtype='<U10')

np.unique(mood)

array(['Happy', 'Neutral', 'Sad'], dtype='<U10')

Activity Status

activity_status

array(['Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',


'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Active', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Inactive',
'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active',
'Active',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active',
'Active',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Inactive', 'Active'], dtype='<U10')

Let's try to get some insights from the data.


What's the average step count?
How can we calculate average? => .mean()

step_count.mean()

2935.9375

User moves an average of 2900 steps a day.

On which day the step count was highest?


How will be find it?

First we find the index of maximum step count and use that index to get the date.

How'll we find the index? =>

Numpy provides a function np.argmax() which returns the index of maximum value element.

Similarly, we have a function np.argmin() which returns the index of minimum element.

step_count.argmax()

69

Here 69 is the index of maximum step count element.

date[step_count.argmax()]

{"type":"string"}
Let's check the calorie burnt on the day

calories_burned[step_count.argmax()]

243

Not bad! 243 calories. Let's try to get the number of steps on that day as well

step_count.max()

7422

7k steps!! Sports mode on!

Let's try to compare step counts on bad mood days and good mood days
Average step count on Sad mood days

np.mean(step_count[mood == 'Sad'])

2103.0689655172414

np.sort(step_count[mood == 'Sad'])

array([ 25, 36, 61, 133, 177, 181, 221, 299, 518, 651,
702,
753, 799, 1230, 1258, 1580, 1648, 1696, 2822, 3148, 3519,
3721,
4061, 4340, 4545, 4687, 5461, 6041, 6676])

np.std(step_count[mood == 'Sad'])

2021.2355035376254

Average step count on happy days

np.mean(step_count[mood == 'Happy'])

3392.725

np.sort(step_count[mood == 'Happy'])

array([ 153, 269, 330, 493, 539, 546, 614, 705, 774, 995,
1421,
1831, 1867, 2203, 2255, 2725, 3149, 3608, 4005, 4033, 4064,
4068,
4136, 4275, 4421, 4435, 4550, 4683, 4732, 4779, 4880, 5195,
5376,
5464, 5537, 5934, 5999, 6314, 6930, 7422])

Average step count on sad days - 2103.


Average step count on happy days - 3392

There may be relation between mood and step count

Let's try to check inverse. Mood when step count was greater/lesser
Mood when step count > 4000

np.unique(mood[step_count > 4000], return_counts = True)

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([22, 9,


7]))

Out of 38 days when step count was more than 4000, user was feeling happy on 22 days.

Mood when step count <= 2000

np.unique(mood[step_count < 2000], return_counts = True)

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([13, 8,


18]))

Out of 39 days, when step count was less than 2000, user was feeling sad on 18 days.

There may be a correlation between Mood and step count

import numpy as np

Reshape in 2D array
We saw reshape and flatten. What if i want to convert a matrix to 1D array using
reshape()

Question: What should I pass in A.reshape() if I want to use it to convert A to 1D


vector?
• (1, 1)? - NO

• It means we only have a single element

• But we don't have a single element

A = np.arange(12).reshape(3,4)

A.reshape(1, 1)

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-223-902e5c35e0d3> in <module>
----> 1 A.reshape(1, 1)

ValueError: cannot reshape array of size 12 into shape (1,1)

• So, (1, 12)? - NO

• It will still remain a 2D Matrix with dimensions 1 ×12

A.reshape(1, 12)

array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

• Correct answer is (12)


• We need a vector of dimension (12,)

• So we need to pass only 1 dimension in reshape()

A.reshape(12)

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

So, Be careful while using reshape() to convert a Matrix into a 1D vector

What will happen if we pass a negative integer in reshape()?


A.reshape(6, -1)

array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])

Surprisingly, it did not give an error


• It is able to figure out on its own what should be the value in-place of negative
integer

• Since no. of elements in our matrix is 12

• And we passed 6 as no. of rows

• It is able to figure out that no. of columns should be 2


Same thing happens with this:

A.reshape(-1, 6)
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])

Matrix multiplication
Question: What will be output of following?
a = np.arange(5)
b = np.ones(5) * 2

a * b

array([0., 2., 4., 6., 8.])

Recall that, if a and b are 1D, * operation will perform elementwise multiplication

Lets try * with 2D arrays


A = np.arange(12).reshape(3, 4)
A

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

B = np.arange(12).reshape(3, 4)
B

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

A * B

array([[ 0, 1, 4, 9],
[ 16, 25, 36, 49],
[ 64, 81, 100, 121]])

Again did element-wise multiplication

For actual Matrix Multiplication, We have a different method/operator


np.matmul()

What is the requirement of dimensions of 2 matrices for Matrix Multiplication?


• Columns of A = Rows of B (A Must condition for Matric Multiplication)

• If A is 3 × 4, B can be 4 ×3... or 4 × ( S o m e t h i n g E l s e )
So, lets reshape B to 4 ×3 instead
B = B.reshape(4, 3)
B

array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])

np.matmul(A, B)

array([[ 42, 48, 54],


[114, 136, 158],
[186, 224, 262]])

• We are getting a 3 ×3 matrix as output

• So, this is doing Matrix Multiplication

There's a direct operator as well for Matrix Multiplication


@

A @ B

array([[ 42, 48, 54],


[114, 136, 158],
[186, 224, 262]])

Question: What will be the dimensions of Matrix Multiplication B @ A?


• 4×4
B @ A

array([[ 20, 23, 26, 29],


[ 56, 68, 80, 92],
[ 92, 113, 134, 155],
[128, 158, 188, 218]])

There is another method in np for doing Matrix Multiplication


np.dot(A, B)

array([[ 42, 48, 54],


[114, 136, 158],
[186, 224, 262]])

Other cases of np.dot()

• It performs dot product when both inputs are 1D array


• It performs multiplication when both input are scalers.
a= np.array([1,2,3])
b = np.array([1,1,1])

np.dot(a,b) # 1*1 + 2*1 + 3*1 = 6

np.dot(4,5)

20

Now, Let's try multiplication of a mix of matrices and vectors


A = np.arange(12).reshape(3, 4) # A is a 3x4 Matrix
A

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

a = np.array([1, 2, 3]) # a although a (3,) can be thought of as row


vector
print(a.shape)

(3,)

np.matmul(A, a)

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-243-76efef6bd8e9> in <module>
----> 1 np.matmul(A, a)

ValueError: matmul: Input operand 1 has a mismatch in its core


dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is
different from 4)

Columns of A ≠ Rows of a

Lets try revervse

np.matmul(a, A)

array([32, 38, 44, 50])

YES, Columns of a (3) = Rows of A (3)

Vectorization
• We have already seen vectorization some time ago
Remember doing scaler operations on np arrays?
A * 2

That's vectorization
• Replacing explicit loops with array expressions is commonly referred to as
vectorization.

• Vectorization helps us to perform operations directly on Arrays instead of


scaler.

• Operation gets performed on each element of np array

Revisiting the example:


A = np.arange(10)

A * 2

array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])

np.vectorize()
• np.vectorize() defines a vectorized function

• It takes numpy arrays as inputs and returns a single numpy array or a tuple of
numpy arrays.

• The vectorized function evaluates element by element of the input arrays like
the python map function

Let's plot graph for y = log(x) (Log function) usingnp.vectorize()


• We will pass in a numpy array, as it can then take a vector/array/list as input

• It will return the vectorized form of math.log() function

import math
import matplotlib.pyplot as plt

x = np.arange(1, 101)

y = np.vectorize(math.log)(x)

plt.plot(x, y)
plt.show()
y

array([0. , 0.69314718, 1.09861229, 1.38629436, 1.60943791,


1.79175947, 1.94591015, 2.07944154, 2.19722458, 2.30258509,
2.39789527, 2.48490665, 2.56494936, 2.63905733, 2.7080502 ,
2.77258872, 2.83321334, 2.89037176, 2.94443898, 2.99573227,
3.04452244, 3.09104245, 3.13549422, 3.17805383, 3.21887582,
3.25809654, 3.29583687, 3.33220451, 3.36729583, 3.40119738,
3.4339872 , 3.4657359 , 3.49650756, 3.52636052, 3.55534806,
3.58351894, 3.61091791, 3.63758616, 3.66356165, 3.68887945,
3.71357207, 3.73766962, 3.76120012, 3.78418963, 3.80666249,
3.8286414 , 3.8501476 , 3.87120101, 3.8918203 , 3.91202301,
3.93182563, 3.95124372, 3.97029191, 3.98898405, 4.00733319,
4.02535169, 4.04305127, 4.06044301, 4.07753744, 4.09434456,
4.11087386, 4.12713439, 4.14313473, 4.15888308, 4.17438727,
4.18965474, 4.20469262, 4.21950771, 4.2341065 , 4.24849524,
4.26267988, 4.27666612, 4.29045944, 4.30406509, 4.31748811,
4.33073334, 4.34380542, 4.35670883, 4.36944785, 4.38202663,
4.39444915, 4.40671925, 4.41884061, 4.4308168 , 4.44265126,
4.4543473 , 4.46590812, 4.47733681, 4.48863637, 4.49980967,
4.51085951, 4.52178858, 4.53259949, 4.54329478, 4.55387689,
4.56434819, 4.57471098, 4.58496748, 4.59511985, 4.60517019])

3 Dimensional Arrays
Vectors, Matrix and Tensors
1. Vector ---> 1-Dimensional Array
2. Matrix ---> 2-Dimensional Array
3. Tensor ---> 3 and above Dimensional Array

Tensor is a general term we use


• Tensor can also be less than 3D

• 2D Tensor is called a Matrix

• 1D Tensor is called a Vector

B = np.arange(24).reshape(2, 3, 4)
B

array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],

[[12, 13, 14, 15],


[16, 17, 18, 19],
[20, 21, 22, 23]]])

Now, What is happening here?

Question: How many dimensions B has?


• 3

• It's a 3-dimensional tensor

How is reshape(2, 3, 4) working?


• If you see, it is giving 2 matrices

• Each matrix has 3 rows and 4 columns

So, that's how reshape() is interpreted for 3D


• 1st argument gives depth (No. of Matrices)

• 2nd agrument gives no. of rows in each depth

• 3rd agrument gives no. of columns in each depth

How can I get just the whole of 1st Matrix?


B[0]

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Question: What value will I get if I do B[0, 0, 0]?


B[0, 0, 0]
0

#### Question: What value will I get if I do `B[1, 1, 1]`?


B[1, 1, 1]

# It looks at Matrix 1, that is, 2nd Matrix (Not Matrix 0)


# Then it looks at row 1 of matrix 1
# Then it looks at column 1 of row 1 of matrix 1

17

We can also Slicing in 3-Dimensions


• Works same as in 2-D matrices

Use Case: Image Manipulation using Numpy


• By now, you already have an idea that Numpy is an amazing open-source Python
library for data manipulation and scientific computing.

• It is used in the domain of linear algebra, Fourier transforms, matrices, and the
data science field.

• NumPy arrays are way faster than Python Lists.

Do you know Numpy can also be used for Image Processing?


• The fundamental idea is that we know images are made up of Numpy ndarrays.

• So we can manipulate these arrays and play with images.

• This use case is to give you a broad overview of Numpy for Image Processing.

Make sure the required libraries are imported


import numpy as np
import matplotlib.pyplot as plt

Now, we'll see how we can play with images using Numpy

Opening an Image
• Well, to play with an image, we first need to open it

But, How can we open an image in our code?


• To open an image, we will use the matplotlib library to read and show images.

• We will cover all the functionalities of matplotlib in detail in visualization lecture.

• For this use case, just know that it uses an image module for working with images.

• It offers two useful methods imread() and imshow().


imread() – to read the images

imshow() – to display the images

Now, Let's go ahead and load our image

Drive link for the image:


Download the image fruits.jpg from here:
https://drive.google.com/file/d/1lHPQUi3wdB6HxN-SNJSBQXK7Z0y0wf32/view?usp=sharing

and place it in your current working directory

Let's download the images first


#fruits image
!gdown 17tYTDPBU5hpby9t0kGd7w_-zBsbY7sEd

Downloading...
From: https://drive.google.com/uc?id=17tYTDPBU5hpby9t0kGd7w_-zBsbY7sEd
To: /content/fruits.png
0% 0.00/4.71M [00:00<?, ?B/s] 100% 4.71M/4.71M [00:00<00:00,
35.7MB/s] 100% 4.71M/4.71M [00:00<00:00, 35.6MB/s]

#emma stone image


!gdown 1o-8yqdTM7cfz_mAaNCi2nH0urFu7pcqI

Downloading...
From: https://drive.google.com/uc?id=1o-8yqdTM7cfz_mAaNCi2nH0urFu7pcqI
To: /content/emma_stone.jpeg
0% 0.00/80.3k [00:00<?, ?B/s] 100% 80.3k/80.3k [00:00<00:00,
78.6MB/s]

img = np.array(plt.imread('fruits.png'))
plt.imshow(img)

<matplotlib.image.AxesImage at 0x7fcc5402a1c0>
Details of an Image
What do you think are the dimensions and shape of this image?
We will see what is the dimension and shape of this image, using the Image.ndim and
Image.shape properties.

print('# of dims: ',img.ndim) # dimension of an image


print('Img shape: ',img.shape) # shape of an image

# of dims: 3
Img shape: (1333, 2000, 3)

How come our 2-D image has 3 dimensions?


• Coloured images have a 3rd dimension for depth or RGB colour channel

• Here, the depth is 3

• But we will come to what RGB colour channels are in a bit

First, Let's understand something peculiar happening here with the shape of image

Do you see something different happening here when we check the shape of image?
• When we discussed 3-D Arrays, we saw that depth was the first element of the
shape tuple

• But when we are loading an image using matplotlib and getting its 3-D array, we
see that depth is the last element of the shape tuple
Why is there a difference b/w normal np array and the np array generated from
Matplotlib in terms of where the depth part of shape appears?
• This is how matplotlib reads the image

• It reads the depth values (R, G and B values) of each pixel one by one and stacks
them one after the other

The shape of imge we read is: (1333, 2000, 3)


• matplotlib first reads that each plane has 1333 ×2000 pixels

• Then, it reads depth values (R, G and B values) of each pixel and place the
values in 3 separate planes

• That is why depth is the last element of shape tuple in np array generated from
an image read by matplotlib

• Whereas in a normal np array, depth is the first element of shape tuple

Now, What are these RGB channels and How can we visualize them?

Visualizing RGB Channels


We can split the image into each RGB color channels using only Numpy

But, What exactly RGB values are?


• These are values of each pixel of an image

• Each pixel is made up of 3 components/channels - Red, Green, Blue - which form


RGB values

• Coloured images are usually stored as 3-dimensional arrays of 8-bit unsigned


integers

• So, the range of values that each channel of a pixel can take is 0 to 28 −1

• That is, each pixel's each channel, R, G and B can range from 0 to 255

Each pixel has these 3 values which combined together forms the colour that the
pixel represents
• So, a pixel [255, 0, 0 ] will be RED in colour

• A pixel [0, 255, 0] will be GREEN in colour

• A pixel [0, 0, 255] will be BLUE in colour

Question: What will be the colour of pixel [0, 0, 0]?


• Black

Question: What will be the colour of pixel [255, 255, 255]?


• White
Now, Let's separate the R, G, B channels in our image:
• We'll make use of slicing of arrays

• For RED channel, we'll set values of GREEN and BLUE to 0

img = np.array(plt.imread('fruits.png'))

img_R = img.copy()

img_R[:, :, (1, 2)] = 0

plt.imshow(img_R)

<matplotlib.image.AxesImage at 0x7fcc3df86670>

Similarly, for GREEN channel, we'll set values of RED and BLUE to 0

... and same for BLUE channel

Rotating an Image (Transpose the Numpy Array)


Now, What if we want to rotate the image?
• Remember image is a Numpy array

• Rotating the image means transposing the array

For this, we'll use the np.transpose() function in numpy

Now, Let's understand np.transpose() function first


• It takes 2 arguments
1st argument is obviously the array that we want to transpose (image array in our case)

2nd argument is axes

• Its a tuple or list of ints

• It contains a permutation of [0,1,..,N-1] where N is the number of axes of array

Now, our image array has 3 axes (3 dimensions) ---> 0th, 1st and 2nd
• We specify how we want to transpose the array by giving an order of these axes
inside the tuple
– Vertical axis (Row axis) is 0th axis
– Horizontal axis (Column axis) is 1st axis
– Depth axis is 2nd axis
• In order to rotate the image, we want to transpose the array

• That is, we want to transpose rows into columns and columns into rows

• So, we want to interchange the order of row and column axis ---> interchange
order of 0th and 1st axis

• We don't want to change the depth axis (2nd axis) ---> So, it will remain at its
original order position
Now, the order of axes in orginal image is (0, 1, 2)

What will be the order of axes rotated image or transposed array?


• The order of axes in rotated image will be (1, 0, 2)

• Order (Position) of 0th and 1st column is interchanged

Let's see it in action:


img = np.array(plt.imread('emma_stone.jpeg'))
img_rotated = np.transpose(img, (1,0,2))
plt.imshow(img_rotated)

<matplotlib.image.AxesImage at 0x7fcc3de81d00>
As you can see:
• We obtained the rotated image by transposing the np array

Trim Image
Now, How can we crop an image using Numpy?
• Remember! Image is a numpy array of pixels

• So, We can trim/crop an image in Numpy using Array using Slicing.

Let's first see the original image


img = np.array(plt.imread('./emma_stone.jpeg'))

plt.imshow(img)

<matplotlib.image.AxesImage at 0x7fcc3de13190>
Now, Let's crop the image to get the face only
• If you see x and y axis, the face starts somewhat from ~200 and ends at ~700 on x-axis
– x-axis in image is column axis in np array
– Columns change along x-axis
• And it lies between ~100 to ~500 on y-axis
– y-axis in image is row axis in np array
– Rows change along y-axis

We'll use this information to slice our image array


img_crop = img[100:500, 200:700, :]
plt.imshow(img_crop)

<matplotlib.image.AxesImage at 0x7fcc3ddf9a60>
Saving Image as ndarray
Now, How can we save ndarray as Image?
To save a ndarray as an image, we can use matplotlib's plt.imsave() method.

• 1st agrument ---> We provide the path and name of file we want to save the image
as

• 2nd agrument ---> We provide the image we want to save

Let's save the cropped face image we obtained previously


path = 'emma_face.jpg'
plt.imsave(path, img_rotated)

Now, if you go and check your current working directory, image would have been
saved by the name emma_face.jpg

Array Splitting and Merging


• In addition to reshaping and selecting subarrays, it is often necessary to split arrays
into smaller arrays or merge arrays into bigger arrays,

• For example, when joining separately computed or measured data series into a
higher-dimensional array, such as a matrix.
Splitting
np.split()
• Splits an array into multiple sub-arrays as views

It takes an argument indices_or_sections


• If indices_or_sections is an integer, n, the array will be divided into n equal
arrays along axis.

• If such a split is not possible, an error is raised.

• If indices_or_sections is a 1-D array of sorted integers, the entries indicate


where along axis the array is split.

• If an index exceeds the dimension of the array along axis, an empty sub-array is
returned correspondingly.

x = np.arange(9)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

np.split(x, 3)

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

np.split(x, [3, 5, 6])

[array([0, 1, 2]), array([3, 4]), array([5]), array([6, 7, 8])]

np.hsplit()
• Splits an array into multiple sub-arrays horizontally (column-wise).
x = np.arange(16.0).reshape(4, 4)
x

array([[ 0., 1., 2., 3.],


[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])
Think of it this way:
• There are 2 axis to a 2-D array
a. 1st axis - Vertical axis
b. 2nd axis - Horizontal axis

Along which axis are we splitting the array?


• The split we want happens across the 2nd axis (Horizontal axis)

• That is why we use hsplit()

So, try to think in terms of "whether the operation is happening along vertical axis or
horizontal axis"
• We are splitting the horizontal axis in this case
np.hsplit(x, 2)

[array([[ 0., 1.],


[ 4., 5.],
[ 8., 9.],
[12., 13.]]), array([[ 2., 3.],
[ 6., 7.],
[10., 11.],
[14., 15.]])]

np.hsplit(x, np.array([3, 6]))

[array([[ 0., 1., 2.],


[ 4., 5., 6.],
[ 8., 9., 10.],
[12., 13., 14.]]), array([[ 3.],
[ 7.],
[11.],
[15.]]), array([], shape=(4, 0), dtype=float64)]
np.vsplit()
• Splits an array into multiple sub-arrays vertically (row-wise).
x = np.arange(16.0).reshape(4, 4)
x

array([[ 0., 1., 2., 3.],


[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])

Now, along which axis are we splitting the array?


• The split we want happens across the 1st axis (Vertical axis)

• That is why we use vsplit()

Again, always try to think in terms of "whether the operation is happening along
vertical axis or horizontal axis"
• We are splitting the vertical axis in this case
np.vsplit(x, 2)

[array([[0., 1., 2., 3.],


[4., 5., 6., 7.]]), array([[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])]

np.vsplit(x, np.array([3]))

[array([[ 0., 1., 2., 3.],


[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]]), array([[12., 13., 14., 15.]])]
Stacking
Let's say we have an array and we want to stack it like this:

Will we use vstack() or hstack()?

Along which axis the operation is happening?


• Vertical axis

• So, we'll use vstack()


np.vstack()
• Stacks a list of arrays vertically (along axis 0 or 1st axis)

• For example, given a list of row vectors, appends the rows to form a matrix.

data = np.arange(5)
data

array([0, 1, 2, 3, 4])

np.vstack((data, data, data))

array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
Now, What if we want to stack the array like this?

• Operation or change is happening along horizontal axis

• So, we'll use hstack()


np.hstack()
• Stacks a list of arrays horizontally (along axis 1)

• For example, given a list of column vectors, appends the columns to form a
matrix.

data = np.arange(5).reshape(5,1)
data

array([[0],
[1],
[2],
[3],
[4]])

np.hstack((data, data, data))

array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
Question: Now, What will be the output of this?
a = np.array([[1], [2], [3]])
b = np.array([[4], [5], [6]])
np.hstack((a, b))

a = np.array([[1], [2], [3]])


a

array([[1],
[2],
[3]])

b = np.array([[4], [5], [6]])


b

array([[4],
[5],
[6]])

np.hstack((a, b))

array([[1, 4],
[2, 5],
[3, 6]])

This time both a and b are column vectors


• So, the stacking of a and b along horizontal axis is more clearly visible

Now, Let's look at a more generalized way of stacking arrays


np.concatenate()
• Creates a new array by appending arrays after each other, along a given axis

• Provides similar functionality, but it takes a keyword argument axis that specifies
the axis along which the arrays are to be concatenated.

Input array to concatenate() needs to be of dimensions atleast equal to the


dimensions of output array
z = np.array([[2, 4]])
z

array([[2, 4]])

z.ndim

zz = np.concatenate([z, z], axis=0)


zz
array([[2, 4],
[2, 4]])

zz = np.concatenate([z, z], axis=1)


zz

array([[2, 4, 2, 4]])

Let's look at a few more examples using np.concatenate()

Question: What will be the output of this?


a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)

a = np.array([[1, 2], [3, 4]])


a

array([[1, 2],
[3, 4]])

b = np.array([[5, 6]])
b

array([[5, 6]])

np.concatenate((a, b), axis=0)

array([[1, 2],
[3, 4],
[5, 6]])

Now, How did it work?


• Dimensions of a is 2 ×2

What is the dimensions of b ?


• 1-D array ?? - NO

• Look carefully!!

• b is a 2-D array of dimensions 1 ×2

axis = 0 ---> It's a vertical axis


• So, changes will happen along vertical axis

• So, b gets concatenated below a


Now, What if we do NOT provide an axis along which to concatenate?
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=None)

array([1, 2, 3, 4, 5, 6])

Can you see what happened here?


• When we don't specify the axis (axis=None), np.concatenate() flattens the arrays
and concatenates them as 1-D row array

Broadcasting

Case1:
You are given two 2D array

[[0, 0, 0], [[0, 1, 2],


[10, 10, 10], and [0, 1, 2],
[20, 20, 20], [0, 1, 2],
[30, 30, 30]] [0, 1, 2]]

Shape of first array is 4x3

Shape of second array is 4x3.


Will addtion of these array be possible? Yes as the shape of these two array matches.

a = np.tile(np.arange(0,40,10), (3,1))
a

array([[ 0, 10, 20, 30],


[ 0, 10, 20, 30],
[ 0, 10, 20, 30]])

np.tile function is used to repeat the given array multiple times

np.tile(np.arange(0,40,10), (3,2))

array([[ 0, 10, 20, 30, 0, 10, 20, 30],


[ 0, 10, 20, 30, 0, 10, 20, 30],
[ 0, 10, 20, 30, 0, 10, 20, 30]])

Now, let's get back to example:

array([[ 0, 10, 20, 30],


[ 0, 10, 20, 30],
[ 0, 10, 20, 30]])

a = a.T

array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])

b = np.tile(np.arange(0,3), (4,1))

array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])

Let's add these two arrays:

a + b

array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
Text book case of element wise addition of two 2D arrays.

Case2 :
Imagine a array like this:

[[0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]]

I want to add the following array to it:

[[0, 1, 2]]

Is it possible? Yes!

What broadcasting does is replicate the second array row wise 4 times to fit the size of first
array.

Here both array have same number of columns

array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])

b = np.arange(0,3)
b

array([0, 1, 2])

a + b

array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
The smaller array is broadcasted across the larger array so that they have compatible shapes.

Case 3:
Imagine I have two array like this:

[[0],
[10],
[20],
[30]]

and

[[0, 1, 2]]

i.e. one column matrix and one row matrix.

When we try to add these array up, broadcasting will replicate first array column wise 3 time and
secord array row wise 4 times to match up the shape.

a = np.arange(0,40,10)
a

array([ 0, 10, 20, 30])

This is a 1D row wise array, But we want this array colum wise? How do we do it ? Reshape?

a = a.reshape(4,1)
a

array([[ 0],
[10],
[20],
[30]])

b = np.arange(0,3)
b

array([0, 1, 2])

a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])

Question: (for general broadcasting rules)


What will be the output of the following?

a = np.arange(8).reshape(2,4)
b = np.arange(16).reshape(4,4)

print(a*b)

a = np.arange(8).reshape(2,4)
a

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

b = np.arange(16).reshape(4,4)
b

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

a + b

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-306-bd58363a63fc> in <module>
----> 1 a + b

ValueError: operands could not be broadcast together with shapes (2,4)


(4,4)

Why didn't it work?


To understand this, let's learn about some General Broadcasting Rules

For each dimension ( going from right side)


1. The size of each dimension should be same OR
2. The size of one dimension should be 1
Rule 1 : If two array differ in the number of dimensions, the shape of one with fewer dimensions is
padded with ones on its leading( Left Side).

Rule 2 : If the shape of two arrays doesnt match in any dimensions, the array with shape equal to 1 is
stretched to match the other shape.

Rule 3 : If in any dimesion the sizes disagree and neither equal to 1 , then Error is raised.

In the above example, the shapes were (2,4) and (4,4).

Let's compare the dimension from right to left

• First, it will compare the right most dimension (4) which are equal.

• Next, it will compare the left dimension i.e. 2 and 4.


– Both conditions fail here. They are neither equal nor one of them is 1.

Hence, it threw an error while broadcasting.

Now, Let's take a look at few more examples

Question : Will broadcasting work in this case ?


A = np.arange(1,10).reshape(3,3)
B = np.array([-1, 0, 1])
A * B

A = np.arange(1,10).reshape(3,3)
A

array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

B = np.array([-1, 0, 1])
B

array([-1, 0, 1])

A * B

array([[-1, 0, 3],
[-4, 0, 6],
[-7, 0, 9]])

Why did A * B work in this case?


• A has 3 rows and 3 columns i.e. (3,3)

• B is a 1-D vector with 3 elements (3,)

Now, if you look at rule 1


Rule 1 : If two array differ in the number of dimensions,
the shape of one with fewer dimensions is padded with ones on its
leading( Left Side).

What is the shape of A and B ?


• A has a shape of (3,3)
• B has a shape of (3,)

As per the rule 1,

• the shape of array with fewer dimensions will be prefixed with ones on its leading side.

Here, shape of B will be prefixed with 1

• So, it's shape will become (1,3)

Can we add a (3,3) and (1,3) array ?


We check the validity of broadcasting. i.e. if broadcasting is possible or not.

Checking the dimension from right to left.

• It will compare the right most dimension (3); which are equal
• Now, it compares the leading dimension.
– The size of one dimension is 1.

Hence, broadcasting condition is satisfied

How will it broadcast?


As per rule 2:

Rule 2 :
If the shape of two arrays doesnt match in any dimensions,
the array with shape equal to 1 is stretched to match the other shape.

Here, array B (1,3) will replicate/stretch its row 3 times to match shape of B

So , B gets broadcasted over A for each row of A

Question: Will broadcasting work in following case ?


A = np.arange(1,10).reshape(3,3)
B = np.arange(3, 10, 3).reshape(3,1)
C = A + B

A = np.arange(1,10).reshape(3,3)
A
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

B = np.arange(3, 10, 3).reshape(3,1)


B

array([[3],
[6],
[9]])

How did this A + B work?


• A has 3 rows and 3 columns i.e. shape (3,3)

• B has 3 rows and 1 column -i.e. shape (3,1)

Do we need to check rule 1 ?


Since, both arrays have same number of dimensions, we can ignore Rule 1.

Let's check whether broadcasting is possible or not


Now, for each dimension from right to left

• Right most dimension is 1.


• Leading dimension are matching (3)

So, conditions for broadcasting are met.

How will broadcasting happen?


As per rule 2, dimension with value 1 will be streched.

• A.shape => (3,3)


• B.shape => (3,1)

Hence, columns of B will be replicated/streched to match dimensions of A.

• So, B gets broadcasted on every column of A


C = A + B
np.round(C, 1)

array([[ 4, 5, 6],
[10, 11, 12],
[16, 17, 18]])

Dimension Expansion and Reducion


Recall that we learnt how to convert 1D array to 2D array in previous lectures
import numpy as np

arr = np.arange(6)
arr

array([0, 1, 2, 3, 4, 5])

arr.shape

(6,)

arr = arr.reshape(1,-1)

arr.shape

(1, 6)

This is also know as expanding dimensions

i.e. we expanded our dimension from 1D to 2D

We can also perform same operation using np.newaxis()

np.expand_dims()
• Expands the shape of an array with axis of length 1.
• Insert a new axis that will appear at the axis position in the expanded array shape.

Function signature: np.exapnd_dims(arr, axis)

Documentation:
https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html#numpy.expand_
dims

arr

array([[0, 1, 2, 3, 4, 5]])

Let's check the shape of arr

arr.shape

(1, 6)

Let's expand the dimensions

arr1 = np.expand_dims(arr, axis = 0 )


arr1

array([[[0, 1, 2, 3, 4, 5]]])

arr1.shape
(1, 1, 6)

What happened here?

Here, the shape of array is (6,)

• We only have one axis i.e. axis = 0.

When we expand dimension with axis =0,

• it add 1 to dimension @ axis = 0


• Shape becomes (1, 6) from (6,)
• i.e. 1 is padded at the given axis location

Let's expand dims @ axis = 1

arr2 = np.expand_dims(arr, axis = 1)


arr2

array([[[0, 1, 2, 3, 4, 5]]])

arr2.shape

(1, 1, 6)

Notice that,

• as we provided axis =1 in argument,


• It expanded the shape along axis =1 i.e 1 was appened @ axis 1.
• Hence, shape become (6,1) from (6,)

We can also do same thing using np.newaxis

np.newaxis
• passed as a parameter to the array.

Let's see how it works

arr = np.arange(6)

arr[np.newaxis, :] #equivalent to np.expand_dims(arr, axis =0)


array([[0, 1, 2, 3, 4, 5]])

We basically passed np.newaxis at the axis position where we want to add an axis

• In arr[np.newaxis, : ],
– we passed it @ axis =0, hence shape 1 was added @ axis = 0
– and therefore, shape became (1, 6)
arr[:, np.newaxis] # equivalent to np.expand_dims(arr, axis = 1 )

array([[0],
[1],
[2],
[3],
[4],
[5]])

What if we want to reduce the number of dimensions?


We can use np.squeeze for reducing the dimensions

np.sqeeze()
• It removes the axis of length 1 from array.
• Inverse of expand_dims

Function signature: np.squeeze(arr, axis)

Documentation: https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html

arr = np.arange(9).reshape(1,1,9)
arr

array([[[0, 1, 2, 3, 4, 5, 6, 7, 8]]])

arr.shape

(1, 1, 9)

arr1 = np.squeeze(arr)
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

arr1.shape

(9,)

Notice that

• it reduced the shape from (1,1,9) to (9,)


• it did so by removing the axis of length 1
• i.e. it removed axis 0 and 1.

We can also remove specific axis using the axis argument

arr

array([[[0, 1, 2, 3, 4, 5, 6, 7, 8]]])

arr.shape

(1, 1, 9)

Let's remove axis = 1

arr1 = np.squeeze(arr, axis = 1 )


arr1

array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

arr1.shape

(1, 9)

What if we try to remove 2nd axis?


np.squeeze(arr, axis = 2 )

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-335-26d8de107e93> in <module>
----> 1 np.squeeze(arr, axis = 2 )

/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in
squeeze(*args, **kwargs)

/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py in
squeeze(a, axis)
1543 return squeeze()
1544 else:
-> 1545 return squeeze(axis=axis)
1546
1547
ValueError: cannot select an axis to squeeze out which has size not
equal to one

It'll throw an error

• as we are trying to remove non- one length axis

Views vs Copies (Shallow vs Deep Copy)


• Numpy manages memory very efficiently

• Which makes it really useful while dealing with large datasets

But how does it manage memory so efficiently?


• Let's create some arrays to understand what's happening in memory while using Numpy
# We'll create np array

a = np.arange(4)
a

array([0, 1, 2, 3])

# Reshape array `a` and store in b

b = a.reshape(2, 2)
b

array([[0, 1],
[2, 3]])

Now we will make some changes to our original array a


a[0] = 100
a

array([100, 1, 2, 3])

What will be values if we print array b ?


b

array([[100, 1],
[ 2, 3]])

Surprise Surprise!!
• Array b got automatically updated

This is an example of Numpy using "Shallow Copy" of data


Now, What happens here?
• Numpy re-uses data as much as possible instead of duplicating it
• This helps Numpy to be efficient

When we created b = a.reshape(2, 2)


• Numpy did NOT make a copy of a to store in b, as we can clearly see

• It is using the same data as in a

• It just looks different (reshaped) in b

• That is why, any changes in a automatically gets reflected in b

How data is stored using Numpy?


• Variable does NOT directly point to data stored in memory

• There is something called Header in-between

What does Header do?


• Variable points to header and header points to data stored in memory

• Header stores information about data - called Metadata

a is pointing to Metadata about our data [0, 1, 2, 3], which may include:
• How many values we have --> 4

• What is the Data Type of data --> int

• What's the Shape --> (4,)

• What's the stride i.e. step size --> 1

When we do b = a.reshape(2, 2)
• Numpy does NOT duplicate the data pointed to by a

• It uses the same data

• And create a New header for b that points to the same data as pointed to by a

b points to a new Header having different values of Metadata of the same data:
• Number of values --> 4

• Data Type --> int

• Shape --> (2, 2)

• Stride i.e. step size --> 1

That is why:
• When data is accessed using a, it gives data in shape (4,)

• And when data is accessed using b, it gives same data in shape (2, 2)
This helps Numpy to save time and space - Making it efficient

Now, Let's see an example where Numpy will create a "Deep Copy" of
data
Now, What if we do this?
Numpy metadata internals

a = np.arange(4)
a

array([0, 1, 2, 3])

# Create `c`

c = a + 2
c

array([2, 3, 4, 5])

# We make changes in a

a[0] = 100
a

array([100, 1, 2, 3])

array([2, 3, 4, 5])

As we can see, c did not get affected on changing a


• Because it is an operation

• A more permanent change in data

• So, Numpy had to create a separate copy for c - i.e., deep copy of array a for
array c

Conclusion:
• Numpy is able to use same data for simpler operations like reshape ---> Shallow
Copy

• It creates a copy of data where operations make more permanent changes to data
---> Deep Copy
Be careful about this while writing code using Numpy

Is there a way to check whether two arrays are sharing memory or not? Yes, there is
np.shares_memory() function to the rescue!!

a= np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

b = a[::2]
b

array([0, 2, 4, 6, 8])

np.shares_memory(a,b)

True

Notice that Slicing creates shallow copies.

Why does slicing create shallow copies ?


Rememeber the stride param of the header.

• Stride is nothing but the step size.

For Array a, we have a stride of 1.

For creating array b,

• we are slicing array a by 2 i.e. stride 2.


• So, it creates a new header for array b with stride = 2 while pointing to the original data
b[0] = 2
b

array([2, 2, 4, 6, 8])

array([2, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Notice how change in b also changed the value in array a


Let's check with deep copy

array([2, 1, 2, 3, 4, 5, 6, 7, 8, 9])

b = a +2
np.shares_memory(a,b)

False

We learnt how .reshape and Slicing returns a view of the original array

• i.e. Any changes made in original array will be reflected in the new array.

However, we saw that creating new array using

• masking or array operation returns deep copy of the array.


• Any changes made in new array are not reflected in the original array.

Numpy also provides us with few functions to make shallow/ deep copy

How to make shallow copy?


Numpy provides us with .view() function which returns view of an array

.view()

Returns view of the original array

• Any changes made in new array will be reflected in original array.

Function documentation:
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html

arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

view_arr = arr.view()
view_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

view_arr[4] = 420
view_arr

array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])

arr

array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])

Notice that changes in view array are reflected in original array.


How do we make deep copy ?
Numpy has .copy() function for that purpose

.copy()

Returns copy of the array.

Documentation (.copy()):
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.c
opy

Documentation: (np.copy()):
https://numpy.org/doc/stable/reference/generated/numpy.copy.html

arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

copy_arr = arr.copy()
copy_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Let's modify the content of copy_arr and check whether it modified the original array as well

copy_arr[3] = 45
copy_arr

array([ 0, 1, 2, 45, 4, 5, 6, 7, 8, 9])

arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Notice that

• The content of original array were not modified as we changed our copy array.

What are object arrays ?


Object arrays are basically array of any python datatype.

Documentation: https://numpy.org/devdocs/reference/arrays.scalars.html#numpy.object_

arr = np.array([1, 'm', [1,2,3]], dtype = 'object')


arr

array([1, 'm', list([1, 2, 3])], dtype=object)


But arrays are suppoed to be homogeous data. How is it storing data of various
types?
Remember that everything is object in python.

Just like python list,

• The data actually stored in object arrays are references to Python objects, not the
objects themselves.

Hence, their elements need not be of the same Python type.

As every element in array is an object. Hence, the dtype = object.

Let's make a copy of object array and check whether it returns a shallow copy or deep copy.

copy_arr = arr.copy()

copy_arr

array([1, 'm', list([1, 2, 3])], dtype=object)

Now, let's try to modify the list elements in copy_arr

copy_arr[2][0] = 999

copy_arr

array([1, 'm', list([999, 2, 3])], dtype=object)

Let's see if it changed the original array as well

arr

array([1, 'm', list([999, 2, 3])], dtype=object)

It did change the original array.

Hence, .copy() will return shallow copy when copying elements of array in object array.

Any change in the 2nd level elements of array will be reflected in original array as well.

So, how do we create deep copy then ?


We can do so using copy.deepcopy() method
copy.deepcopy()

Returns the deep copy of array

Documentation: https://docs.python.org/3/library/copy.html#copy.deepcopy

import copy

arr = np.array([1, 'm', [1,2,3]], dtype = 'object')


arr

array([1, 'm', list([1, 2, 3])], dtype=object)

Let's make a copy using deepcopy()

copy = copy.deepcopy(arr)

copy

array([1, 'm', list([1, 2, 3])], dtype=object)

Let's modify the array inside copy array

copy[2][0] = 999

copy

array([1, 'm', list([999, 2, 3])], dtype=object)

arr

array([1, 'm', list([1, 2, 3])], dtype=object)

Notice that,

• the changes in copy array didn't reflect back to original array.

copy.deepcopy() returns deep copy of an array.

Summarizing
• .view() returns shallow copy of array
• .copy() returns deep copy of an array except for object type array
• copy.deepcopy() returns deep copy of an array.

You might also like