Content
• Installing and Importing Numpy
• Introduction to use case
• Motivation: Why to use Numpy? - How is it different from Python Lists?
• Creating a Basic Numpy Array
– From a List - array(), shape, ndim
– From a range and stepsize - arange()
– type() ndarray
• How numpy works under the hood?
• Indexing and Slicing on 1D
– Indexing
– Slicing
– Masking (Fancy Indexing)
• Operation on array
• Universal Functions (ufunc) on 1D array
– Aggregate Function/ Reduction functions - sum(), mean(), min(), max()
• Usecase: calculate NPS
– loading data: np.loadtxt()
– np.empty()
– np.unique()
• Reshape with -ve index
• Matrix Multiplication
– matmul(), @, dot()
• Vectorization
– np.vectorize()
• 3D arrays
• Use Case: Image Manipulation using Numpy
– Opening an Image
– Details of an image
– Visualizing Channels
– Rotating an Image (Transposing a Numpy Array)
– Trim image
– Saving ndarray as Image
• 2-D arrays (Matrices)
– reshape()
– 2 Questions
– Transpose
– Converting Matrix back to Vector - flatten()
• Indexing and Slicing on 2D
– Indexing
– Slicing
– Masking (Fancy Indexing)
• Universal Functions (ufunc) on 2D
– Aggregate Function/ Reduction functions - sum(), mean(), min(), max()
– Axis argument
– Logical Operations
– Sorting function - sort(), argsort()
• Use Case: Fitness Data analysis
– Loading data set and EDA using numpy
– np.argmax()
• Array splitting and Merging
– Splitting arrays - split(), hsplit(), vsplit()
– Merging Arrays - hstack(), vstack(), concatenate()
• Broadcasting
– np.tile()
• Dimension Expansion and Reduction
– np.expand_dims()
– np.newaxis
– np.sqeeze()
• Shallow vs Deep Copy
– view()
– copy()
– copy.deepcopy()
Installation Using %pip
!pip install numpy
Looking in indexes: https://pypi.org/simple, https://us-
python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-
packages (1.22.4)
Importing Numpy
• We'll import numpy as its alias name np for ease of typing
import numpy as np
Use Case: NPS (Net Promoter Score)
Imagine you are a Data Analyst @ Airbnb
You've been asked to analyze user survey data and report NPS to the management
But, what exactly is NPS?
Have you seen something like this ?
Link: https://drive.google.com/file/d/1-u8e-v_90JdikorKsKzBM-JJqoRtzsN8/view?usp=sharing
This is called Likelyhood to Recommend Survey
• Responses are given a scale ranging from 0–10,
– with 0 labeled with “Not at all likely,” and
– 10 labeled with “Extremely likely.”
Based on this, we calculate the Net Promoter score
How to calculate NPS score?
We label our responses into 3 categories:
• Detractors: Respondents with a score of 0-6
• Passive: Respondents with a score of 7-8
• Promoters: score of 9-10.
And
Net Promoter score = % Promoters - % Detractors.
How is NPS helpful?
Why would we want to analyse the survey data for NPS?
NPS helps a brand in gauging its brand value and sentiment in the market.
• Promoters are highly likely to recommend your product or sevice. Hence, bringing in
more business
• whereas, Detractors are likely to recommend against your product or service’s usage.
Hence, bringing the business down.
These insights can help business make customer oriented decision along with product
improvisation.
Two third of Fortune 500 companies use NPS
Lets first look at the data we have gathered
Dataset: https://drive.google.com/file/d/1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK/view?
usp=sharing
Notice that the file contains the score for likelyhood to recommend survey
Using NumPy
• we will bin our data into promoters/detractors
• calulate the percentage of promoters/detractors
• calculate NPS
Why use Numpy?
Suppose you are given a list of numbers and you have to find square of each number and store it
in original list.
a = [1,2,3,4,5]
Solution: Basic approach iterate over the list and square each element
a = [i**2 for i in a]
print(a)
[1, 4, 9, 16, 25]
Lets try the same operation with NumPy
a = np.array([1,2,3,4,5])
print(a**2)
[ 1 4 9 16 25]
The biggest benefit of NumPy is that it supports element-wise operation
Notice how easy and clean is the syntax.
But is the clean syntax and ease in writing the only benefit we are getting here?
• To understand this, lets time these operations
• We will use %timeit to measure the time for operations
l = range(1000000)
%timeit [i**2 for i in l]
546 ms ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It took approx 300 ms sec per loop to iterate and square all elements from 0 to 999,999
Let's peform same operation using numpy arrays
• We will use np.array() method for this.
• np.array() simply converts a python array to numpy array.
• We can peform element wise operation using numpy
l = np.array(range(1000000))
%timeit l**2
797 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Notice per loop time for numpy operation: 1.46 micro sec
What is the major reason behind numpy's faster computation?
• The numpy array is densely packed in memory due to it's homogenous type.
• Numpy is able to divide a task into multiple subtasks and process them parallelly.
• Numpy functions are implemented in C. Which again makes it faster compared to Python
Lists.
What is the takeaway from this exercise?
• NumPy provides clean syntax for providing element-wise operations
• Per loop time for numpy to perform operation is much lesser than list
Infact, Numpy is one of the most important packages for performing numerical computations
Why?
Most of computations in DS/ML/DA can be broken down into element-wise operations
Let's create some basic arrays in NumPy
First method we'll see in Numpy is array()
• We pass a Python list into np.array()
• It converts that Python list into a numpy array
# Let's create a 1-D array
arr1 = np.array([1, 2, 3])
print(arr1)
print(arr1 * 2)
[1 2 3]
[2 4 6]
• This is NOT a normal Python list
• It's a numpy array - supports element-wise operation
Question: What will be the dimension of this array?
1 coz it is a 1D array.
We can get the dimension of array using ndim property
arr1.ndim
Numpy arrays have an other property called shape which can tell us number of
elements across every dimension
We can also get the shape of the array.
arr1.shape
(3,)
Let's take another example to understand shape and ndim better
arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])
print(arr2)
[[ 1 2 3]
[ 4 5 6]
[10 11 12]]
What do you think will be the dimension of this 2D array?
arr2.ndim
And what about the shape?
arr2.shape
(3, 3)
Lets create some sequences in Numpy
From a range and stepsize - arange()
• np.arange()
• Similar to range()
• We can pass starting point, ending point (not included in array) and step-size
• arange(start, end, step)
arr2 = np.arange(1, 5)
arr2
array([1, 2, 3, 4])
arr2_stepsize = np.arange(1, 5, 2)
arr2_stepsize
array([1, 3])
• np.arange() behaves in same way as range() function
But then why not call it np.range?
• In np.arange(), we can pass a floating point number as step-size
arr3 = np.arange(1, 5, 0.5)
arr3
array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
Lets check the type of a Numpy array
type(arr1)
numpy.ndarray
But why are we calling it an array? Why not a NumPy list?
How numpy works under the hood?
• It's a Python Library, will write code in Python to use numpy
However, numpy itself is written in C
Allows numpy to manage memory very efficiently
But why is C arrays more efficient or faster than Python Lists?
• In Python List, we can store objects of different types together - int, float, string,
etc.
• The actual values of objects are stored somewhere else in the memory
• Only References to those objects (R1, R2, R3, ...) are stored in the Python List.
• So, when we have to access an element in Python List, we first access the
reference to that element and then that reference allows us to access the value of
element stored in memory
C array does all this in one step
• C array stores objects of same data type together
• Actual values are stored in same contiguous memory
• So, when we have to access an element in C array, we access it directly using
indices.
BUT, notice that this would make NumPy array lose the flexibility to store
heterogenous data
==> Unlike Python lists, NumPy array can only hold contigous data
• So numpy arrays are NOT really Python lists
• They are basically C arrays
Let's further see the C type behaviour of Numpy
• For this, lets pass a floating point number as one of the values in np array
arr4 = np.array([1, 2, 3, 4])
arr4
array([1, 2, 3, 4])
arr4 = np.array([1, 2, 3, 4.0])
arr4
array([1., 2., 3., 4.])
• Notice that int is raised to float
• Because one single C array can store values of only one data type i.e. homogenous data
• If you press "Shift+tab" inside np.array() function
• You can see function's signature
– name
– input parameters
– default values of input parameters
• Look at dtype=None
– dtype means data-type
– which is set to None by default
What if we set dtype to float?
arr5 = np.array([1, 2, 3, 4])
arr5
array([1, 2, 3, 4])
arr5 = np.array([1, 2, 3, 4], dtype="float")
arr5
array([1., 2., 3., 4.])
Conclusion:
• "nd" in ndarray stands for n-dimensional - ndarray means an n-dimensional array
Indexing and Slicing upon Numpy arrays
m1 = np.arange(12)
m1
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Indexing in np arrays
• Works same as lists
m1[0] # gives first element of array
m1[6] # out of index Error
Question: What will be th output of m1[-1] ?
m1[-1]
11
Numpy also supports negative indexing.
You can also use list of indexes in numpy
m1 = np.array([100,200,300,400,500,600])
m1[[2,3,4,1,2,2]]
array([300, 400, 500, 200, 300, 300])
Did you notice how single index can be repeated multiple times when giving list of indexes?
Slicing
• Similar to Python lists
• We can slice out and get a part of np array
• Can also mix Indexing and Slicing
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1[:5]
array([1, 2, 3, 4, 5])
Question: What'll be output of arr[-5:-1]?
m1[-5:-1]
array([6, 7, 8, 9])
Question: What'll be the output for arr[-5:-1: -1] ?
m1[-5: -1: -1]
array([], dtype=int64)
Fancy indexing (Masking)
• Numpy arrays can be indexed with boolean arrays (masks).
• This method is called fancy indexing.
What would happen if we do this?
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1 < 6
array([ True, True, True, True, True, False, False, False, False,
False])
• Comparison operation also happens on each element
• All the values before 6 return True and all values after 6 return False
Now, Let's use this to filter or mask values from our array
• Condition will be passed instead of indices and slice ranges
m1[m1 < 6]
array([1, 2, 3, 4, 5])
Notice that,
• Value corresponding to True is retained
• Value corresponding to False is filtered out
This is similar to filtering using filter() function
filter(lambda x: x < 6, [...]) Refer Notes
How can we filter/mask even values from our array?
m1[m1%2 == 0]
array([ 2, 4, 6, 8, 10])
m1[m1%2==0].shape
(5,)
Question: Multiple conditions in numpy
Given an array of elements from 0 to 10, filter the elements which are
multiple of 2 or 5.
a = [0,1,2,3,4,5,6,7,8,9,10]
output should be [0,2,4,5,6,8,10]
a = np.arange(11)
a[(a %2 == 0) | (a%5 == 0)]
array([ 0, 2, 4, 5, 6, 8, 10])
(Optional) Why do we use & , | instead of and, or keywords for writing multiple
condition ?
The difference is that
• and and or gauge the truth of whole object, whereas
• & and | are bitwise operator and perform operation on each bit
Recall that everything is treated as object in python.
So, when we use and or or,
• Python will treat object as single Boolean entity.
bool(42)
True
bool(0)
False
bool(42 or 0)
True
bool(42 and 0)
False
Now, when we apply & and |, it does bitwise and and or instead of doing on whole object.
bin(42)
{"type":"string"}
bin(50)
{"type":"string"}
bin(42 & 50)
{"type":"string"}
bin(42 | 50)
{"type":"string"}
Notice that the bits of objects are being compared to get the result.
In similar fashion, you can think of numpy array with boolean values as string of bits
• where 1 = True
• and 0 = False
import numpy as np
arr = np.array([1, 0, 1, 0, 1, 0], dtype = bool)
arr1 = np.array([1, 1, 0, 0, 1, 0], dtype =bool)
arr
array([ True, False, True, False, True, False])
arr1
array([ True, True, False, False, True, False])
arr | arr1
array([ True, True, True, False, True, False])
Using and or or on arrays will try to evaulate the condition on entire array which is not defined
(as numpy is made for element wise operation)
arr and arr1
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-53-2ae36cd9a0b9> in <module>
----> 1 arr and arr1
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
(Optional) Now, What is the dtype of mask?
It is a boolean array. Hence, it can be treated as string of bits and hence, we use & and | operator
on it
(Optional) But why do we use () when using multiple conditions?
Remember that the precedence of &, | is more than >, <, ==.
Let's take an example:
a %2 == 0 | a%5 == 0
In above mask, it'll end up evaluating 0 | a&5 first which will throw an error.
Operations on Numpy Arrays
We have already seen operations of a Numpy array and a scalar (single value)
arr = np.arange(4)
arr
array([0, 1, 2, 3])
arr + 3
array([3, 4, 5, 6])
Lets see some algerbraic operations on two arrays
# Corresponding elements of arrays get added
a = np.array([1, 2, 3])
b = np.array([2, 2, 2])
a + b
array([3, 4, 5])
# Corresponding elements of arrays get multiplied
a * b
array([2, 4, 6])
Question: What will be the output of the following ?
a = np.array([0,2,3])
b = np.array([1,3,5])
a*b
array([ 0, 6, 15])
Numpy will do element wise multiplication
Aggregate / Universal Functions on 1D array (ufunc)
Numpy provides various universal functions that cover a wide variety of operations.
For example:
• When addition of constant to array is performed element-wise using + operator, then
np.add() is called internally.
import numpy as np
a = np.array([1,2,3,4])
a+2 # ufunc `np.add()` called automatically
array([3, 4, 5, 6])
np.add(a,2)
array([3, 4, 5, 6])
• These functions operate on ndarray (N-dimensional array) i.e Numpy’s array
class.
• They perform fast element-wise array operations.
Aggregate Functions/ Reduction functions
Now, how would calculate the sum of elements of an array?
np.sum()
• It sums all the values in np array
a = np.arange(1, 11)
a
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.sum(a) # sums all the values present in array
55
Now, What if we want to find the average value or median value of all the elements
in an array?
np.mean()
• np.mean() gives mean of all values in np array
np.mean(a)
5.5
Now, we want to find the minimum value in the array
np.min() function can help us with this
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.min(a)
We can also find max elements in an array.
np.max() function will give us maximum value in the array
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.max(a) # maximum value
10
Usecase: NPS
Let's first download the dataset
import numpy as np
!gdown 1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK
Downloading...
From: https://drive.google.com/uc?id=1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK
To: /content/survey.txt
0% 0.00/2.55k [00:00<?, ?B/s] 100% 2.55k/2.55k [00:00<00:00,
4.80MB/s]
Let's load the data we saw earlier. For this we will use .loadtxt() function
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html
score = np.loadtxt('survey.txt', dtype ='int')
We provide file name along with the dtype of data we want to load in
Let's see what the data looks like
score[:5]
array([ 7, 10, 5, 9, 9])
Let's check the number of responses
score.shape
(1167,)
There are a total of 1167 responses for the LTR survey
Let's perform some sanity check on data
Let's check the minimum and max value in array
score.min()
score.max()
10
Looks like, there are no records with 0 score.
Now, let's calculate NPS using these response.
NPS = % Promoters - % Detractors
Now, in order to calculate NPS, we need to calculate two things:
• % Promoters
• % Detractors
In order to calculate % Promoters and % Detractors, we need to get the count of promoter as
well as detractor.
Question: How can we get the count of Promoter/ Detractor ?
We can do so by using fancy indexing (masking )
Let's get the count of promoter and detractors
Detractors have a score <=6
detractors = score[score <= 6].shape[0]
total = score.shape[0]
percent_detractors = detractors/total*100
percent_detractors
28.449014567266495
Similarly, Promoters have a score 9-10
promoters = score[score >= 9].shape[0]
percent_promoters = promoters/total*100
percent_promoters
52.185089974293064
Calculating NPS
For calculating NPS, we need to
% promoters - % detractors
nps = percent_promoters - percent_detractors
nps
23.73607540702657
np.round(nps)
24.0
Now, there are two types of data:
• Numerical (we have seen so far)
• Categorical (in form of categories)
For example:
• An array of Blood pressure status for a patient on various days.
• It can have values 'Low', 'Good', 'High'
Similarly,
• An array of workout data which contains muscle area impacted from muscle training
• Values can be core, legs, shoulder, back
Similarly, We will map our scores into 3 categories s.t:
• 0 - 6: Detractors
• 7 - 8: Passive
• 9 - 10: Prometers
This process is called binning
But, why binning?
Binning helps us reduce the number of unique values.
• simplifying the data without any significant loss of info.
• helps in quick absorption of information
• also helps in visualization (will be discussed later)
• also helps in simplyfying inputs ML models (hence, reducing computational complexity)
How'll we bin our data ?
Will this work ?
score[score <= 6] = 'Detractors'
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-86-b1629d66b9e5> in <module>
----> 1 score[score <= 6] = 'Detractors'
ValueError: invalid literal for int() with base 10: 'Detractors'
Why didn't the above code work?
Recall the array are of homogenours datatype
• dtype of our array is int
We are trying to assign string to int array; Hence, it is throwing an error
So, what do we do?
What if we create an array of same length as score array and assign values to new array based
on values present in score array.
How do we initialize new array based on length of preexisting array ?
Numpy provides us with a method to initialize empty array : np.empty()
It takes the following arguments:
• shape
• dtype
Question: What will be the shape and dtype of new array ?
arr = np.empty(shape = score.shape, dtype = 'str')
arr
array(['', '', '', ..., '', '', ''], dtype='<U1')
Notice the following
• All the elements of the array are empty string
• But, the dtype is being shown as U1.
Didn't we initialize the dtype as string?
Why is the dtype being shown as <U1 ?
U1 means Unicode string of length 1.
Whenever we initialize the array with str datatype, it automatically initializes it of type Unicode
string with length 1.
Question: What will happen in following case? Will the string be assigned to the 0th
index ?
arr[0] = 'hellow'
arr
array(['h', '', '', ..., '', '', ''], dtype='<U1')
Notice that,
• as the length is defined as 1
• it automatically truncates the rest of string and only stores the first character.
But, we want to store whole string 'Detractor/Promoter/Passive'.
How do we change the cap on length of string ?
We can specify the length of string while initializing the array.
arr = np.empty(shape = score.shape, dtype = 'U10')
arr
array(['', '', '', ..., '', '', ''], dtype='<U10')
arr.shape
(1167,)
Instead of specifying the dtype as str, we initialize it as Un where n is the number of characters
Now, we have got a string array. Let's bin our score values
arr[score <= 6] = 'detractors'
arr
array(['', '', 'detractors', ..., 'detractors', '', ''], dtype='<U10')
Similarly, we can do it for passive and promoters
arr[(score >= 7) & (score <= 8)] = 'passive'
arr[score >= 9] = 'promoters'
arr
array(['passive', 'promoters', 'detractors', ..., 'detractors',
'promoters', 'promoters'], dtype='<U10')
arr[:15]
array(['passive', 'promoters', 'detractors', 'promoters', 'promoters',
'detractors', 'passive', 'promoters', 'promoters', 'promoters',
'promoters', 'detractors', 'promoters', 'promoters',
'passive'],
dtype='<U10')
Now, we have array with desired values.
How do we count the number of instance for each value ?
There are two ways of doing it.
Let's look at long way first.
We do fancy indexing for each unique value and get the shape
detractors_count = arr[arr == 'detractors'].shape[0]
detractors_count
332
passive_count = arr[arr == 'passive'].shape[0]
passive_count
226
promoters_count = arr[arr == 'promoters'].shape[0]
promoters_count
609
Now, there's a short way as well.
Numpy provides us a function .unique() to get unique element
np.unique(arr)
array(['detractors', 'passive', 'promoters'], dtype='<U10')
But we want the count of each unique element.
For this, we can pass argument return_counts = True
np.unique(arr, return_counts = True)
(array(['detractors', 'passive', 'promoters'], dtype='<U10'),
array([332, 226, 609]))
unique, counts = np.unique(arr, return_counts = True)
unique
array(['detractors', 'passive', 'promoters'], dtype='<U10')
counts
array([332, 226, 609])
Now, let's calculate the percent of promoters and detractors
% Promoters
percent_promoters = counts[2]/counts.sum()*100
% Detractors
percent_detractors = counts[0]/counts.sum()*100
Calculating NPS
For calculating NPS, we need to
% promoters - % detractors
nps = percent_promoters - percent_detractors
nps
23.73607540702657
np.round(nps)
24.0
(Optional) What is a good NPS score ?
Source: https://chattermill.com/blog/what-is-a-good-nps-score/
(Optional) Industry wise NPS benchmark
Use Case: Fitbit
Imagine you are a Data Scientist at Fitbit
You've been given a user data to analyse and find some insights which can be shown on the
smart watch.
But why would we want to analyse the user data for desiging the watch?
These insights from the user data can help business make customer oriented decision for the
product design.
Lets first look at the data we have gathered
Link: https://drive.google.com/file/d/1Uxwd4H-tfM64giRS1VExMpQXKtBBtuP0/view?
usp=sharing
Notice that there are some user features in the data
There are provided as various columns in the data.
Every row is called a record or data point
What are all the features provided to us?
• Date
• Step Count
• Mood (Categorical)
• Calories Burned
• Hours of sleep
• Feeling Active (Categorical)
Using NumPy, we will explore this data to look for some interesting insights - Exploratory
Data Analysis.
EDA is all about asking the right questions
What kind of questions can we answer using this data?
• How many records and features are there in the dataset?
• What is the average step count?
• On which day the step count was highest/lowest?
Can we find some deeper insights?
We can probably see how daily activity affects sleep and moood.
We will try finding
• How daily activity affects mood?
import numpy as np
Working with 2-D arrays (Matrices)
Question : How do we create a matrix using numpy?
m1 = np.array([[1,2,3],[4,5,6]])
m1
# Nicely printing out in a Matrix form
array([[1, 2, 3],
[4, 5, 6]])
How can we check shape of a numpy array?
m1.shape # arr1 has 3 elements
(2, 3)
Question: What is the type of this result of arr1.shape? Which data structure is this?
Tuple
Now, What is the dimension of this array?
m1.ndim
Question
a = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
b = len(a)
What'll be the value of b?
Ans: 3
Explanation: len(nD array) will give you magnitude of first dimension
a = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
a
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
len(a)
What will be the shape of array a?
a.shape
(3, 3)
• So, it is a 2-D array with 3 rows and 3 columns
Clearly, if we have to create high-dimensional arrays, we cannot do this using np.arange()
directly
How can we create high dimensional arrays?
• Using reshape()
For a 2D array
• First argument is no. of rows
• Second argument is no. of columns
m2 = np.arange(1, 13)
m2
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
• We can pass the desired dimensions of array in reshape()
In what ways can we convert this array with 12 values into high-dimensional array?
Can we make m2 a 4 × 4 array?
• Obviously NO
• 4 × 4 requires 16 values, but we only have 12 in m2
m2 = np.arange(1, 13)
m2.reshape(4, 4)
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-122-fc70b006b379> in <module>
1 m2 = np.arange(1, 13)
----> 2 m2.reshape(4, 4)
ValueError: cannot reshape array of size 12 into shape (4,4)
So, What are the ways in which we can reshape it?
• 4 ×3
• 3×4
• 6 ×2
• 2 ×6
• 1 ×12
• 12 ×1
m2 = np.arange(1, 13)
m2.reshape(4, 3)
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
m2 = np.arange(1, 13)
m2
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
m2.shape
(12,)
Lets do some reshaping here
m2.reshape(12, 1)
array([[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12]])
Now, What's the difference b/w (12,) and (12, 1)?
• (12,) means its a 1D array
• (12, 1) means its a 2D array
Question
What will be output for the following code?
a = np.array([[1,2,3],[0,1,4]])
print(a.ndim)
Ans: 2
a = np.array([[1,2,3],[0,1,4]])
print(a.ndim)
Since it is a 2 dimensional array, the number of dimension will be 2.
Transpose
• Change rows into columns and columns into rows
• Just use <Matrix>.T
a = np.arange(3)
a
array([0, 1, 2])
a.T
array([0, 1, 2])
Why did Transpose did not work?
• Because numpy sees a as a vector (3,), NOT a matrix
• We'll have to reshape the vector a to make it a matrix
a = np.arange(3).reshape(1, 3)
a
# Now a has dimensions (1, 3) instead of just (3,)
# It has 1 row and 3 columns
array([[0, 1, 2]])
a.T
# It has 3 rows and 1 column
array([[0],
[1],
[2]])
Conclusion
• Transpose works only on matrices
Flattening of an array
What if we want to convert this 2D or nD array back to 1D array?
There is a function named flatten() to help you do so.
A = np.arange(12).reshape(3, 4)
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
A.flatten()
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Indexing and Slicing on 2D Numpy arrays
Indexing in np arrays
• Works same as lists
m1 = np.arange(1,10).reshape((3,3))
m1
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
m1[1][2]
OR
• We just use [0, 0] (indexes separated by commas)
What will be the output of this?
m1[1, 1] #m1[row, column]
We saw how we can use list of indexes in numpy array
m1 = np.array([100,200,300,400,500,600])
m1[[2,3,4,1,2,2]]
array([300, 400, 500, 200, 300, 300])
How'll list of indexes work in 2D array ?
m1 = np.arange(9).reshape((3,3))
m1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
m1[[0,1,2],[0,1,2]] # picking up element (0,0), (1,1) and (2,2)
array([0, 4, 8])
Slicing
• Need to provide two slice ranges - one for row and one for column
• Can also mix Indexing and Slicing
m1 = np.arange(12).reshape(3,4)
m1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
m1[:2] # gives first two rows
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
How can we get columns from 2D array?
m1[:, :2] # gives first two columns
array([[0, 1],
[4, 5],
[8, 9]])
Question: Given an 2-D array
m1 = [[0,1,2,3],
[4,5,6,7],
[8,9,10,11]]
m1 = m1.reshape((3,4))
Question for you: Can you just get this much of our array m1?
[[5, 6],
[9, 10]]
Remember our m1 is:
m1 = [[0, 1, 2, 3],
[4, 5, 6, 7],
[8, 9, 10, 11]]
# First get rows 1 to all
# Then get columns 1 to 3 (not included)
m1[1:, 1:3]
array([[ 5, 6],
[ 9, 10]])
Question: What if I need 1st and 3rd column?
[[1, 3],
[5, 7],
[9,11]]
# Get all rows
# Then get columns from 1 to all with step of 2
m1[:, 1::2]
array([[ 1, 3],
[ 5, 7],
[ 9, 11]])
• We can also pass indices of required columns as a Tuple to get the same result
# Get all rows
# Then get columns 1 and 3
m1[:, (1,3)]
array([[ 1, 3],
[ 5, 7],
[ 9, 11]])
Fancy indexing (Masking)
What would happen if we do this?
m1 = np.arange(12).reshape(3, 4)
m1 < 6
array([[ True, True, True, True],
[ True, True, False, False],
[False, False, False, False]])
• A matrix having boolean values True and False is returned
• We can use this boolean matrix to filter our array
Now, Let's use this to filter or mask values from our array
• Condition will be passed instead of indices and slice ranges
m1[m1 < 6]
# Value corresponding to True is retained
# Value corresponding to False is filtered out
array([0, 1, 2, 3, 4, 5])
How can we filter/mask even values from our array?
m1[m1%2 == 0]
array([ 0, 2, 4, 6, 8, 10])
But did you notice that matrix gets converted into a 1D array after masking?
m1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
m1[m1%2 == 0]
array([ 0, 2, 4, 6, 8, 10])
It happens because
• To retain matrix shape, it has to retain all the elements
• It cannot retain its 3 × 4 with lesser number of elements
• So, this filtering operation implicitly converts high-dimensional array into 1D array
If we want, we can reshape the resulting 1D array into 2D
• But, we need to know beforehand what is the dimension or number of elements in
resulting 1D array
m1[m1%2==0].shape
(6,)
m1[m1%2==0].reshape(2, 3)
array([[ 0, 2, 4],
[ 6, 8, 10]])
Universal Functions (ufunc) on 2D & Axis
Aggregate Functions/ Reduction functions
We saw how aggregate functions work on 1D array in last class
arr = np.arange(3)
arr
array([0, 1, 2])
arr.sum()
Let's apply Aggregate functions on 2D array
np.sum()
a = np.arange(12).reshape(3, 4)
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.sum(a) # sums all the values present in array
66
What if we want to do the elements row-wise or column-wise?
• By setting axis parameter
What will np.sum(a, axis=0) do?
• np.sum(a, axis=0) adds together values in DIFFERENT rows
• axis = 0 ---> Changes will happen along the vertical axis
• Summing of values happen in the vertical direction
• Rows collapse/merge when we do axis=0
np.sum(a, axis=0)
array([12, 15, 18, 21])
Now, What if we specify axis=1?
• np.sum(a, axis=1) adds together values in DIFFERENT columns
• axis = 1 ---> Changes will happen along the horizontal axis
• Summing of values happen in the horizontal direction
• Columns collapse/merge when we do axis=1
np.sum(a, axis=1)
array([ 6, 22, 38])
Now, What if we want to find the average value or median value of all the elements
in an array?
np.mean(a) # no need to give any axis
5.5
What if we want to find the mean of elements in each row or in each column?
• We can do same thing with axis parameter like we did for np.sum() function
Question: Now you tell What will np.mean(a, axis=0) give?
• It will give mean of values in DIFFERENT rows
• axis = 0 ---> Changes will happen along the vertical axis
• Mean of values will be calculated in the vertical direction
• Rows collapse/merge when we do axis=0
np.mean(a, axis=0)
array([4., 5., 6., 7.])
How can we get mean of elements in each column?
• np.mean(a, axis=1) will give mean of values in DIFFERENT columns
• axis = 1 ---> Changes will happen along the horizontal axis
• Mean of values will be calculated in the horizontal direction
• Columns collapse/merge when we do axis=1
np.mean(a, axis=1)
array([1.5, 5.5, 9.5])
Now, we want to find the minimum value in the array
np.min() function can help us with this
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.min(a)
0
What if we want to find row wise minimum value?
Use axis argument!!
np.min(a, axis = 1 )
array([0, 4, 8])
We can also find max elements in an array.
np.max() function will give us maximum value in the array
We can also use axis argument to find row wise/ column wise max.
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.max(a) # maximum value
11
np.max(a, axis = 0) # column wise max
array([ 8, 9, 10, 11])
Logical Operations
Now, What if we want to check whether "any" element of array follows a specific
condition?
Let's say we have 2 arrays:
a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
a, b
(array([1, 2, 3, 4]), array([4, 3, 2, 1]))
Let's say we want to find out if any of the elements in array a is smaller than any of
the corresponding elements in array b
np.any() can become handy here as well
• any() returns True if any of the corresponding elements in the argument arrays follow
the provided condition.
a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
np.any(a<b) # Atleast 1 element in a < corresponding element in b
True
Let's try the same condition with different arrays:
a = np.array([4,5,6,7])
b = np.array([4,3,2,1])
np.any(a<b) # All elements in a >= corresponding elements in b
False
• In this case, NONE of the elements in a were smaller than their corresponding
elements in b
• So, np.any(a<b) returned False
What if we want to check whether "all" the elements in our array are non-zero or
follow the specified condition?
np.all()
Now, What if we want to check whether "all" the elements in our array follow a
specific condition?
Let's say we want to find out if all the elements in array a are smaller than all the
corresponding elements in array b
Again, Let's say we have 2 arrays:
a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
a, b
(array([1, 2, 3, 4]), array([4, 3, 2, 1]))
np.all(a<b) # Not all elements in a < corresponding elements in b
False
Let's try it with different arrays
a = np.array([1,0,0,0])
b = np.array([4,3,2,1])
np.all(a<b) # All elements in a < corresponding elements in b
True
• In this case, ALL the elements in a were smaller than their corresponding
elements in b
• So, np.all(a<b) returned True
Multiple conditions for .all() function
a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
((a <= b) & (b <= c)).all()
True
What if we want to update an array based on condition ?
Suppose you are given an array of integers and you want to update it based on following
condition:
• if element is > 0, change it to +1
• if element < 0, change it to -1.
How will you do it ?
arr = np.array([-3,4,27,34,-2, 0, -45,-11,4, 0 ])
arr
array([ -3, 4, 27, 34, -2, 0, -45, -11, 4, 0])
You can use masking to update the array (as discussed in last class)
arr[arr > 0] = 1
arr [arr < 0] = -1
arr
array([-1, 1, 1, 1, -1, 0, -1, -1, 1, 0])
There is a numpy function which can help us with it.
np.where()
Function signature: np.where(condition, [x, y])
This functions returns an ndarray whose elements are chosen from x or y depending on
condition.
arr = np.array([-3,4,27,34,-2, 0, -45,-11,4, 0 ])
np.where(arr > 0, +1, -1)
array([-1, 1, 1, 1, -1, -1, -1, -1, 1, -1])
arr
array([ -3, 4, 27, 34, -2, 0, -45, -11, 4, 0])
Notice that it didn't change the original array.
Sorting Arrays
• We can also sort the elements of an array along a given specified axis
• Default axis is the last axis of the array.
np.sort()
a = np.array([2,30,41,7,17,52])
a
array([ 2, 30, 41, 7, 17, 52])
np.sort(a)
array([ 2, 7, 17, 30, 41, 52])
array([ 2, 30, 41, 7, 17, 52])
Let's work with 2D array
a = np.arange(9,0,-1).reshape(3,3)
a
array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])
Question: What will be the result when we sort using axis = 0 ?
np.sort(a, axis = 0)
array([[3, 2, 1],
[6, 5, 4],
[9, 8, 7]])
Recall that when axis =0
• change will happen along vertical axis.
Hence, it will sort out row wise.
a
array([[9, 8, 7],
[6, 5, 4],
[3, 2, 1]])
• Original array is still the same. It hasn't changed
np.argsort()
• Returns the indices that would sort an array.
• Performs an indirect sort along the given axis.
• It returns an array of indices of the same shape as a that index data along the
given axis in sorted order.
a = np.array([2,30,41,7,17,52])
a
array([ 2, 30, 41, 7, 17, 52])
np.argsort(a)
array([0, 3, 4, 1, 2, 5])
As you can see:
• The orginal indices of elements are in same order as the orginal elements would be in
sorted order
Use Case: Fitness data analysis
Let's first download the dataset
!gdown 1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
Downloading...
From: https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
To: /content/fit.txt
0% 0.00/3.43k [00:00<?, ?B/s] 100% 3.43k/3.43k [00:00<00:00,
6.65MB/s]
Let's load the data we saw earlier. For this we will use .loadtxt()
function
data = np.loadtxt('fit.txt', dtype='str')
We provide file name along with the dtype of data we want to load in
data[:5]
array([['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],
['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'],
['08-10-2017', '25', 'Sad', '0', '5', 'Inactive'],
['09-10-2017', '5461', 'Sad', '174', '4', 'Inactive'],
['10-10-2017', '6915', 'Neutral', '223', '5', 'Active']],
dtype='<U10')
What's the shape of the data?
data.shape
(96, 6)
There are 96 records and each record has 6 features. These features are:
• Date
• Step count
• Mood
• Calories Burned
• Hours of sleep
• activity status
Notice that above array is a homogenous containing all the data as strings
In order to work with strings, categorical data and numerical data, we will have save every
feature seperately
How will we extract features in seperate variables?
We can get some idea on how data is saved.
Lets see whats the first element of data
data[0]
array(['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],
dtype='<U10')
Hm, this extracts a row not a column
Think about it.
Whats the way to change columns to rows and rows to columns?
Transpose
data.T[0]
array(['06-10-2017', '07-10-2017', '08-10-2017', '09-10-2017',
'10-10-2017', '11-10-2017', '12-10-2017', '13-10-2017',
'14-10-2017', '15-10-2017', '16-10-2017', '17-10-2017',
'18-10-2017', '19-10-2017', '20-10-2017', '21-10-2017',
'22-10-2017', '23-10-2017', '24-10-2017', '25-10-2017',
'26-10-2017', '27-10-2017', '28-10-2017', '29-10-2017',
'30-10-2017', '31-10-2017', '01-11-2017', '02-11-2017',
'03-11-2017', '04-11-2017', '05-11-2017', '06-11-2017',
'07-11-2017', '08-11-2017', '09-11-2017', '10-11-2017',
'11-11-2017', '12-11-2017', '13-11-2017', '14-11-2017',
'15-11-2017', '16-11-2017', '17-11-2017', '18-11-2017',
'19-11-2017', '20-11-2017', '21-11-2017', '22-11-2017',
'23-11-2017', '24-11-2017', '25-11-2017', '26-11-2017',
'27-11-2017', '28-11-2017', '29-11-2017', '30-11-2017',
'01-12-2017', '02-12-2017', '03-12-2017', '04-12-2017',
'05-12-2017', '06-12-2017', '07-12-2017', '08-12-2017',
'09-12-2017', '10-12-2017', '11-12-2017', '12-12-2017',
'13-12-2017', '14-12-2017', '15-12-2017', '16-12-2017',
'17-12-2017', '18-12-2017', '19-12-2017', '20-12-2017',
'21-12-2017', '22-12-2017', '23-12-2017', '24-12-2017',
'25-12-2017', '26-12-2017', '27-12-2017', '28-12-2017',
'29-12-2017', '30-12-2017', '31-12-2017', '01-01-2018',
'02-01-2018', '03-01-2018', '04-01-2018', '05-01-2018',
'06-01-2018', '07-01-2018', '08-01-2018', '09-01-2018'],
dtype='<U10')
Great, we could extract first column
Lets extract all the columns and save them in seperate variables
date, step_count, mood, calories_burned, hours_of_sleep,
activity_status = data.T
step_count
array(['5464', '6041', '25', '5461', '6915', '4545', '4340', '1230',
'61',
'1258', '3148', '4687', '4732', '3519', '1580', '2822', '181',
'3158', '4383', '3881', '4037', '202', '292', '330', '2209',
'4550', '4435', '4779', '1831', '2255', '539', '5464', '6041',
'4068', '4683', '4033', '6314', '614', '3149', '4005', '4880',
'4136', '705', '570', '269', '4275', '5999', '4421', '6930',
'5195', '546', '493', '995', '1163', '6676', '3608', '774',
'1421',
'4064', '2725', '5934', '1867', '3721', '2374', '2909', '1648',
'799', '7102', '3941', '7422', '437', '1231', '1696', '4921',
'221', '6500', '3575', '4061', '651', '753', '518', '5537',
'4108',
'5376', '3066', '177', '36', '299', '1447', '2599', '702',
'133',
'153', '500', '2127', '2203'], dtype='<U10')
step_count.dtype
dtype('<U10')
Notice the data type of step_count and other variables. It's a string type where U means Unicode
String. and 10 means 10 bytes.
Why? Because Numpy type-casted all the data to strings.
Let's convert the data types of these variables
Step Count
step_count = np.array(step_count, dtype = 'int')
step_count.dtype
dtype('int64')
step_count
array([5464, 6041, 25, 5461, 6915, 4545, 4340, 1230, 61, 1258,
3148,
4687, 4732, 3519, 1580, 2822, 181, 3158, 4383, 3881, 4037,
202,
292, 330, 2209, 4550, 4435, 4779, 1831, 2255, 539, 5464,
6041,
4068, 4683, 4033, 6314, 614, 3149, 4005, 4880, 4136, 705,
570,
269, 4275, 5999, 4421, 6930, 5195, 546, 493, 995, 1163,
6676,
3608, 774, 1421, 4064, 2725, 5934, 1867, 3721, 2374, 2909,
1648,
799, 7102, 3941, 7422, 437, 1231, 1696, 4921, 221, 6500,
3575,
4061, 651, 753, 518, 5537, 4108, 5376, 3066, 177, 36,
299,
1447, 2599, 702, 133, 153, 500, 2127, 2203])
Calories Burned
calories_burned = np.array(calories_burned, dtype = 'int')
calories_burned.dtype
dtype('int64')
Hours of Sleep
hours_of_sleep = np.array(hours_of_sleep, dtype = 'int')
hours_of_sleep.dtype
dtype('int64')
Mood
Mood is a categorical data type. As a name says, categorical data type has two or more
categories in it.
Let's check the values of mood variable
mood
array(['Neutral', 'Sad', 'Sad', 'Sad', 'Neutral', 'Sad', 'Sad', 'Sad',
'Sad', 'Sad', 'Sad', 'Sad', 'Happy', 'Sad', 'Sad', 'Sad',
'Sad',
'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral',
'Neutral',
'Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy',
'Happy', 'Happy', 'Neutral', 'Happy', 'Happy', 'Happy',
'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Neutral',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Neutral', 'Sad', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Sad', 'Neutral',
'Neutral',
'Sad', 'Sad', 'Neutral', 'Neutral', 'Happy', 'Neutral',
'Neutral',
'Sad', 'Neutral', 'Sad', 'Neutral', 'Neutral', 'Sad', 'Sad',
'Sad',
'Sad', 'Happy', 'Neutral', 'Happy', 'Neutral', 'Sad', 'Sad',
'Sad',
'Neutral', 'Neutral', 'Sad', 'Sad', 'Happy', 'Neutral',
'Neutral',
'Happy'], dtype='<U10')
np.unique(mood)
array(['Happy', 'Neutral', 'Sad'], dtype='<U10')
Activity Status
activity_status
array(['Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Active', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Inactive',
'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active',
'Active',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active',
'Active',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Inactive', 'Active'], dtype='<U10')
Let's try to get some insights from the data.
What's the average step count?
How can we calculate average? => .mean()
step_count.mean()
2935.9375
User moves an average of 2900 steps a day.
On which day the step count was highest?
How will be find it?
First we find the index of maximum step count and use that index to get the date.
How'll we find the index? =>
Numpy provides a function np.argmax() which returns the index of maximum value element.
Similarly, we have a function np.argmin() which returns the index of minimum element.
step_count.argmax()
69
Here 69 is the index of maximum step count element.
date[step_count.argmax()]
{"type":"string"}
Let's check the calorie burnt on the day
calories_burned[step_count.argmax()]
243
Not bad! 243 calories. Let's try to get the number of steps on that day as well
step_count.max()
7422
7k steps!! Sports mode on!
Let's try to compare step counts on bad mood days and good mood days
Average step count on Sad mood days
np.mean(step_count[mood == 'Sad'])
2103.0689655172414
np.sort(step_count[mood == 'Sad'])
array([ 25, 36, 61, 133, 177, 181, 221, 299, 518, 651,
702,
753, 799, 1230, 1258, 1580, 1648, 1696, 2822, 3148, 3519,
3721,
4061, 4340, 4545, 4687, 5461, 6041, 6676])
np.std(step_count[mood == 'Sad'])
2021.2355035376254
Average step count on happy days
np.mean(step_count[mood == 'Happy'])
3392.725
np.sort(step_count[mood == 'Happy'])
array([ 153, 269, 330, 493, 539, 546, 614, 705, 774, 995,
1421,
1831, 1867, 2203, 2255, 2725, 3149, 3608, 4005, 4033, 4064,
4068,
4136, 4275, 4421, 4435, 4550, 4683, 4732, 4779, 4880, 5195,
5376,
5464, 5537, 5934, 5999, 6314, 6930, 7422])
Average step count on sad days - 2103.
Average step count on happy days - 3392
There may be relation between mood and step count
Let's try to check inverse. Mood when step count was greater/lesser
Mood when step count > 4000
np.unique(mood[step_count > 4000], return_counts = True)
(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([22, 9,
7]))
Out of 38 days when step count was more than 4000, user was feeling happy on 22 days.
Mood when step count <= 2000
np.unique(mood[step_count < 2000], return_counts = True)
(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([13, 8,
18]))
Out of 39 days, when step count was less than 2000, user was feeling sad on 18 days.
There may be a correlation between Mood and step count
import numpy as np
Reshape in 2D array
We saw reshape and flatten. What if i want to convert a matrix to 1D array using
reshape()
Question: What should I pass in A.reshape() if I want to use it to convert A to 1D
vector?
• (1, 1)? - NO
• It means we only have a single element
• But we don't have a single element
A = np.arange(12).reshape(3,4)
A.reshape(1, 1)
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-223-902e5c35e0d3> in <module>
----> 1 A.reshape(1, 1)
ValueError: cannot reshape array of size 12 into shape (1,1)
• So, (1, 12)? - NO
• It will still remain a 2D Matrix with dimensions 1 ×12
A.reshape(1, 12)
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
• Correct answer is (12)
• We need a vector of dimension (12,)
• So we need to pass only 1 dimension in reshape()
A.reshape(12)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
So, Be careful while using reshape() to convert a Matrix into a 1D vector
What will happen if we pass a negative integer in reshape()?
A.reshape(6, -1)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
Surprisingly, it did not give an error
• It is able to figure out on its own what should be the value in-place of negative
integer
• Since no. of elements in our matrix is 12
• And we passed 6 as no. of rows
• It is able to figure out that no. of columns should be 2
Same thing happens with this:
A.reshape(-1, 6)
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
Matrix multiplication
Question: What will be output of following?
a = np.arange(5)
b = np.ones(5) * 2
a * b
array([0., 2., 4., 6., 8.])
Recall that, if a and b are 1D, * operation will perform elementwise multiplication
Lets try * with 2D arrays
A = np.arange(12).reshape(3, 4)
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
B = np.arange(12).reshape(3, 4)
B
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
A * B
array([[ 0, 1, 4, 9],
[ 16, 25, 36, 49],
[ 64, 81, 100, 121]])
Again did element-wise multiplication
For actual Matrix Multiplication, We have a different method/operator
np.matmul()
What is the requirement of dimensions of 2 matrices for Matrix Multiplication?
• Columns of A = Rows of B (A Must condition for Matric Multiplication)
• If A is 3 × 4, B can be 4 ×3... or 4 × ( S o m e t h i n g E l s e )
So, lets reshape B to 4 ×3 instead
B = B.reshape(4, 3)
B
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
np.matmul(A, B)
array([[ 42, 48, 54],
[114, 136, 158],
[186, 224, 262]])
• We are getting a 3 ×3 matrix as output
• So, this is doing Matrix Multiplication
There's a direct operator as well for Matrix Multiplication
@
A @ B
array([[ 42, 48, 54],
[114, 136, 158],
[186, 224, 262]])
Question: What will be the dimensions of Matrix Multiplication B @ A?
• 4×4
B @ A
array([[ 20, 23, 26, 29],
[ 56, 68, 80, 92],
[ 92, 113, 134, 155],
[128, 158, 188, 218]])
There is another method in np for doing Matrix Multiplication
np.dot(A, B)
array([[ 42, 48, 54],
[114, 136, 158],
[186, 224, 262]])
Other cases of np.dot()
• It performs dot product when both inputs are 1D array
• It performs multiplication when both input are scalers.
a= np.array([1,2,3])
b = np.array([1,1,1])
np.dot(a,b) # 1*1 + 2*1 + 3*1 = 6
np.dot(4,5)
20
Now, Let's try multiplication of a mix of matrices and vectors
A = np.arange(12).reshape(3, 4) # A is a 3x4 Matrix
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
a = np.array([1, 2, 3]) # a although a (3,) can be thought of as row
vector
print(a.shape)
(3,)
np.matmul(A, a)
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-243-76efef6bd8e9> in <module>
----> 1 np.matmul(A, a)
ValueError: matmul: Input operand 1 has a mismatch in its core
dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is
different from 4)
Columns of A ≠ Rows of a
Lets try revervse
np.matmul(a, A)
array([32, 38, 44, 50])
YES, Columns of a (3) = Rows of A (3)
Vectorization
• We have already seen vectorization some time ago
Remember doing scaler operations on np arrays?
A * 2
That's vectorization
• Replacing explicit loops with array expressions is commonly referred to as
vectorization.
• Vectorization helps us to perform operations directly on Arrays instead of
scaler.
• Operation gets performed on each element of np array
Revisiting the example:
A = np.arange(10)
A * 2
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
np.vectorize()
• np.vectorize() defines a vectorized function
• It takes numpy arrays as inputs and returns a single numpy array or a tuple of
numpy arrays.
• The vectorized function evaluates element by element of the input arrays like
the python map function
Let's plot graph for y = log(x) (Log function) usingnp.vectorize()
• We will pass in a numpy array, as it can then take a vector/array/list as input
• It will return the vectorized form of math.log() function
import math
import matplotlib.pyplot as plt
x = np.arange(1, 101)
y = np.vectorize(math.log)(x)
plt.plot(x, y)
plt.show()
y
array([0. , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
1.79175947, 1.94591015, 2.07944154, 2.19722458, 2.30258509,
2.39789527, 2.48490665, 2.56494936, 2.63905733, 2.7080502 ,
2.77258872, 2.83321334, 2.89037176, 2.94443898, 2.99573227,
3.04452244, 3.09104245, 3.13549422, 3.17805383, 3.21887582,
3.25809654, 3.29583687, 3.33220451, 3.36729583, 3.40119738,
3.4339872 , 3.4657359 , 3.49650756, 3.52636052, 3.55534806,
3.58351894, 3.61091791, 3.63758616, 3.66356165, 3.68887945,
3.71357207, 3.73766962, 3.76120012, 3.78418963, 3.80666249,
3.8286414 , 3.8501476 , 3.87120101, 3.8918203 , 3.91202301,
3.93182563, 3.95124372, 3.97029191, 3.98898405, 4.00733319,
4.02535169, 4.04305127, 4.06044301, 4.07753744, 4.09434456,
4.11087386, 4.12713439, 4.14313473, 4.15888308, 4.17438727,
4.18965474, 4.20469262, 4.21950771, 4.2341065 , 4.24849524,
4.26267988, 4.27666612, 4.29045944, 4.30406509, 4.31748811,
4.33073334, 4.34380542, 4.35670883, 4.36944785, 4.38202663,
4.39444915, 4.40671925, 4.41884061, 4.4308168 , 4.44265126,
4.4543473 , 4.46590812, 4.47733681, 4.48863637, 4.49980967,
4.51085951, 4.52178858, 4.53259949, 4.54329478, 4.55387689,
4.56434819, 4.57471098, 4.58496748, 4.59511985, 4.60517019])
3 Dimensional Arrays
Vectors, Matrix and Tensors
1. Vector ---> 1-Dimensional Array
2. Matrix ---> 2-Dimensional Array
3. Tensor ---> 3 and above Dimensional Array
Tensor is a general term we use
• Tensor can also be less than 3D
• 2D Tensor is called a Matrix
• 1D Tensor is called a Vector
B = np.arange(24).reshape(2, 3, 4)
B
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
Now, What is happening here?
Question: How many dimensions B has?
• 3
• It's a 3-dimensional tensor
How is reshape(2, 3, 4) working?
• If you see, it is giving 2 matrices
• Each matrix has 3 rows and 4 columns
So, that's how reshape() is interpreted for 3D
• 1st argument gives depth (No. of Matrices)
• 2nd agrument gives no. of rows in each depth
• 3rd agrument gives no. of columns in each depth
How can I get just the whole of 1st Matrix?
B[0]
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Question: What value will I get if I do B[0, 0, 0]?
B[0, 0, 0]
0
#### Question: What value will I get if I do `B[1, 1, 1]`?
B[1, 1, 1]
# It looks at Matrix 1, that is, 2nd Matrix (Not Matrix 0)
# Then it looks at row 1 of matrix 1
# Then it looks at column 1 of row 1 of matrix 1
17
We can also Slicing in 3-Dimensions
• Works same as in 2-D matrices
Use Case: Image Manipulation using Numpy
• By now, you already have an idea that Numpy is an amazing open-source Python
library for data manipulation and scientific computing.
• It is used in the domain of linear algebra, Fourier transforms, matrices, and the
data science field.
• NumPy arrays are way faster than Python Lists.
Do you know Numpy can also be used for Image Processing?
• The fundamental idea is that we know images are made up of Numpy ndarrays.
• So we can manipulate these arrays and play with images.
• This use case is to give you a broad overview of Numpy for Image Processing.
Make sure the required libraries are imported
import numpy as np
import matplotlib.pyplot as plt
Now, we'll see how we can play with images using Numpy
Opening an Image
• Well, to play with an image, we first need to open it
But, How can we open an image in our code?
• To open an image, we will use the matplotlib library to read and show images.
• We will cover all the functionalities of matplotlib in detail in visualization lecture.
• For this use case, just know that it uses an image module for working with images.
• It offers two useful methods imread() and imshow().
imread() – to read the images
imshow() – to display the images
Now, Let's go ahead and load our image
Drive link for the image:
Download the image fruits.jpg from here:
https://drive.google.com/file/d/1lHPQUi3wdB6HxN-SNJSBQXK7Z0y0wf32/view?usp=sharing
and place it in your current working directory
Let's download the images first
#fruits image
!gdown 17tYTDPBU5hpby9t0kGd7w_-zBsbY7sEd
Downloading...
From: https://drive.google.com/uc?id=17tYTDPBU5hpby9t0kGd7w_-zBsbY7sEd
To: /content/fruits.png
0% 0.00/4.71M [00:00<?, ?B/s] 100% 4.71M/4.71M [00:00<00:00,
35.7MB/s] 100% 4.71M/4.71M [00:00<00:00, 35.6MB/s]
#emma stone image
!gdown 1o-8yqdTM7cfz_mAaNCi2nH0urFu7pcqI
Downloading...
From: https://drive.google.com/uc?id=1o-8yqdTM7cfz_mAaNCi2nH0urFu7pcqI
To: /content/emma_stone.jpeg
0% 0.00/80.3k [00:00<?, ?B/s] 100% 80.3k/80.3k [00:00<00:00,
78.6MB/s]
img = np.array(plt.imread('fruits.png'))
plt.imshow(img)
<matplotlib.image.AxesImage at 0x7fcc5402a1c0>
Details of an Image
What do you think are the dimensions and shape of this image?
We will see what is the dimension and shape of this image, using the Image.ndim and
Image.shape properties.
print('# of dims: ',img.ndim) # dimension of an image
print('Img shape: ',img.shape) # shape of an image
# of dims: 3
Img shape: (1333, 2000, 3)
How come our 2-D image has 3 dimensions?
• Coloured images have a 3rd dimension for depth or RGB colour channel
• Here, the depth is 3
• But we will come to what RGB colour channels are in a bit
First, Let's understand something peculiar happening here with the shape of image
Do you see something different happening here when we check the shape of image?
• When we discussed 3-D Arrays, we saw that depth was the first element of the
shape tuple
• But when we are loading an image using matplotlib and getting its 3-D array, we
see that depth is the last element of the shape tuple
Why is there a difference b/w normal np array and the np array generated from
Matplotlib in terms of where the depth part of shape appears?
• This is how matplotlib reads the image
• It reads the depth values (R, G and B values) of each pixel one by one and stacks
them one after the other
The shape of imge we read is: (1333, 2000, 3)
• matplotlib first reads that each plane has 1333 ×2000 pixels
• Then, it reads depth values (R, G and B values) of each pixel and place the
values in 3 separate planes
• That is why depth is the last element of shape tuple in np array generated from
an image read by matplotlib
• Whereas in a normal np array, depth is the first element of shape tuple
Now, What are these RGB channels and How can we visualize them?
Visualizing RGB Channels
We can split the image into each RGB color channels using only Numpy
But, What exactly RGB values are?
• These are values of each pixel of an image
• Each pixel is made up of 3 components/channels - Red, Green, Blue - which form
RGB values
• Coloured images are usually stored as 3-dimensional arrays of 8-bit unsigned
integers
• So, the range of values that each channel of a pixel can take is 0 to 28 −1
• That is, each pixel's each channel, R, G and B can range from 0 to 255
Each pixel has these 3 values which combined together forms the colour that the
pixel represents
• So, a pixel [255, 0, 0 ] will be RED in colour
• A pixel [0, 255, 0] will be GREEN in colour
• A pixel [0, 0, 255] will be BLUE in colour
Question: What will be the colour of pixel [0, 0, 0]?
• Black
Question: What will be the colour of pixel [255, 255, 255]?
• White
Now, Let's separate the R, G, B channels in our image:
• We'll make use of slicing of arrays
• For RED channel, we'll set values of GREEN and BLUE to 0
img = np.array(plt.imread('fruits.png'))
img_R = img.copy()
img_R[:, :, (1, 2)] = 0
plt.imshow(img_R)
<matplotlib.image.AxesImage at 0x7fcc3df86670>
Similarly, for GREEN channel, we'll set values of RED and BLUE to 0
... and same for BLUE channel
Rotating an Image (Transpose the Numpy Array)
Now, What if we want to rotate the image?
• Remember image is a Numpy array
• Rotating the image means transposing the array
For this, we'll use the np.transpose() function in numpy
Now, Let's understand np.transpose() function first
• It takes 2 arguments
1st argument is obviously the array that we want to transpose (image array in our case)
2nd argument is axes
• Its a tuple or list of ints
• It contains a permutation of [0,1,..,N-1] where N is the number of axes of array
Now, our image array has 3 axes (3 dimensions) ---> 0th, 1st and 2nd
• We specify how we want to transpose the array by giving an order of these axes
inside the tuple
– Vertical axis (Row axis) is 0th axis
– Horizontal axis (Column axis) is 1st axis
– Depth axis is 2nd axis
• In order to rotate the image, we want to transpose the array
• That is, we want to transpose rows into columns and columns into rows
• So, we want to interchange the order of row and column axis ---> interchange
order of 0th and 1st axis
• We don't want to change the depth axis (2nd axis) ---> So, it will remain at its
original order position
Now, the order of axes in orginal image is (0, 1, 2)
What will be the order of axes rotated image or transposed array?
• The order of axes in rotated image will be (1, 0, 2)
• Order (Position) of 0th and 1st column is interchanged
Let's see it in action:
img = np.array(plt.imread('emma_stone.jpeg'))
img_rotated = np.transpose(img, (1,0,2))
plt.imshow(img_rotated)
<matplotlib.image.AxesImage at 0x7fcc3de81d00>
As you can see:
• We obtained the rotated image by transposing the np array
Trim Image
Now, How can we crop an image using Numpy?
• Remember! Image is a numpy array of pixels
• So, We can trim/crop an image in Numpy using Array using Slicing.
Let's first see the original image
img = np.array(plt.imread('./emma_stone.jpeg'))
plt.imshow(img)
<matplotlib.image.AxesImage at 0x7fcc3de13190>
Now, Let's crop the image to get the face only
• If you see x and y axis, the face starts somewhat from ~200 and ends at ~700 on x-axis
– x-axis in image is column axis in np array
– Columns change along x-axis
• And it lies between ~100 to ~500 on y-axis
– y-axis in image is row axis in np array
– Rows change along y-axis
We'll use this information to slice our image array
img_crop = img[100:500, 200:700, :]
plt.imshow(img_crop)
<matplotlib.image.AxesImage at 0x7fcc3ddf9a60>
Saving Image as ndarray
Now, How can we save ndarray as Image?
To save a ndarray as an image, we can use matplotlib's plt.imsave() method.
• 1st agrument ---> We provide the path and name of file we want to save the image
as
• 2nd agrument ---> We provide the image we want to save
Let's save the cropped face image we obtained previously
path = 'emma_face.jpg'
plt.imsave(path, img_rotated)
Now, if you go and check your current working directory, image would have been
saved by the name emma_face.jpg
Array Splitting and Merging
• In addition to reshaping and selecting subarrays, it is often necessary to split arrays
into smaller arrays or merge arrays into bigger arrays,
• For example, when joining separately computed or measured data series into a
higher-dimensional array, such as a matrix.
Splitting
np.split()
• Splits an array into multiple sub-arrays as views
It takes an argument indices_or_sections
• If indices_or_sections is an integer, n, the array will be divided into n equal
arrays along axis.
• If such a split is not possible, an error is raised.
• If indices_or_sections is a 1-D array of sorted integers, the entries indicate
where along axis the array is split.
• If an index exceeds the dimension of the array along axis, an empty sub-array is
returned correspondingly.
x = np.arange(9)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
np.split(x, 3)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]
np.split(x, [3, 5, 6])
[array([0, 1, 2]), array([3, 4]), array([5]), array([6, 7, 8])]
np.hsplit()
• Splits an array into multiple sub-arrays horizontally (column-wise).
x = np.arange(16.0).reshape(4, 4)
x
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])
Think of it this way:
• There are 2 axis to a 2-D array
a. 1st axis - Vertical axis
b. 2nd axis - Horizontal axis
Along which axis are we splitting the array?
• The split we want happens across the 2nd axis (Horizontal axis)
• That is why we use hsplit()
So, try to think in terms of "whether the operation is happening along vertical axis or
horizontal axis"
• We are splitting the horizontal axis in this case
np.hsplit(x, 2)
[array([[ 0., 1.],
[ 4., 5.],
[ 8., 9.],
[12., 13.]]), array([[ 2., 3.],
[ 6., 7.],
[10., 11.],
[14., 15.]])]
np.hsplit(x, np.array([3, 6]))
[array([[ 0., 1., 2.],
[ 4., 5., 6.],
[ 8., 9., 10.],
[12., 13., 14.]]), array([[ 3.],
[ 7.],
[11.],
[15.]]), array([], shape=(4, 0), dtype=float64)]
np.vsplit()
• Splits an array into multiple sub-arrays vertically (row-wise).
x = np.arange(16.0).reshape(4, 4)
x
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])
Now, along which axis are we splitting the array?
• The split we want happens across the 1st axis (Vertical axis)
• That is why we use vsplit()
Again, always try to think in terms of "whether the operation is happening along
vertical axis or horizontal axis"
• We are splitting the vertical axis in this case
np.vsplit(x, 2)
[array([[0., 1., 2., 3.],
[4., 5., 6., 7.]]), array([[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])]
np.vsplit(x, np.array([3]))
[array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]]), array([[12., 13., 14., 15.]])]
Stacking
Let's say we have an array and we want to stack it like this:
Will we use vstack() or hstack()?
Along which axis the operation is happening?
• Vertical axis
• So, we'll use vstack()
np.vstack()
• Stacks a list of arrays vertically (along axis 0 or 1st axis)
• For example, given a list of row vectors, appends the rows to form a matrix.
data = np.arange(5)
data
array([0, 1, 2, 3, 4])
np.vstack((data, data, data))
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
Now, What if we want to stack the array like this?
• Operation or change is happening along horizontal axis
• So, we'll use hstack()
np.hstack()
• Stacks a list of arrays horizontally (along axis 1)
• For example, given a list of column vectors, appends the columns to form a
matrix.
data = np.arange(5).reshape(5,1)
data
array([[0],
[1],
[2],
[3],
[4]])
np.hstack((data, data, data))
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
Question: Now, What will be the output of this?
a = np.array([[1], [2], [3]])
b = np.array([[4], [5], [6]])
np.hstack((a, b))
a = np.array([[1], [2], [3]])
a
array([[1],
[2],
[3]])
b = np.array([[4], [5], [6]])
b
array([[4],
[5],
[6]])
np.hstack((a, b))
array([[1, 4],
[2, 5],
[3, 6]])
This time both a and b are column vectors
• So, the stacking of a and b along horizontal axis is more clearly visible
Now, Let's look at a more generalized way of stacking arrays
np.concatenate()
• Creates a new array by appending arrays after each other, along a given axis
• Provides similar functionality, but it takes a keyword argument axis that specifies
the axis along which the arrays are to be concatenated.
Input array to concatenate() needs to be of dimensions atleast equal to the
dimensions of output array
z = np.array([[2, 4]])
z
array([[2, 4]])
z.ndim
zz = np.concatenate([z, z], axis=0)
zz
array([[2, 4],
[2, 4]])
zz = np.concatenate([z, z], axis=1)
zz
array([[2, 4, 2, 4]])
Let's look at a few more examples using np.concatenate()
Question: What will be the output of this?
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
a = np.array([[1, 2], [3, 4]])
a
array([[1, 2],
[3, 4]])
b = np.array([[5, 6]])
b
array([[5, 6]])
np.concatenate((a, b), axis=0)
array([[1, 2],
[3, 4],
[5, 6]])
Now, How did it work?
• Dimensions of a is 2 ×2
What is the dimensions of b ?
• 1-D array ?? - NO
• Look carefully!!
• b is a 2-D array of dimensions 1 ×2
axis = 0 ---> It's a vertical axis
• So, changes will happen along vertical axis
• So, b gets concatenated below a
Now, What if we do NOT provide an axis along which to concatenate?
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=None)
array([1, 2, 3, 4, 5, 6])
Can you see what happened here?
• When we don't specify the axis (axis=None), np.concatenate() flattens the arrays
and concatenates them as 1-D row array
Broadcasting
Case1:
You are given two 2D array
[[0, 0, 0], [[0, 1, 2],
[10, 10, 10], and [0, 1, 2],
[20, 20, 20], [0, 1, 2],
[30, 30, 30]] [0, 1, 2]]
Shape of first array is 4x3
Shape of second array is 4x3.
Will addtion of these array be possible? Yes as the shape of these two array matches.
a = np.tile(np.arange(0,40,10), (3,1))
a
array([[ 0, 10, 20, 30],
[ 0, 10, 20, 30],
[ 0, 10, 20, 30]])
np.tile function is used to repeat the given array multiple times
np.tile(np.arange(0,40,10), (3,2))
array([[ 0, 10, 20, 30, 0, 10, 20, 30],
[ 0, 10, 20, 30, 0, 10, 20, 30],
[ 0, 10, 20, 30, 0, 10, 20, 30]])
Now, let's get back to example:
array([[ 0, 10, 20, 30],
[ 0, 10, 20, 30],
[ 0, 10, 20, 30]])
a = a.T
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
b = np.tile(np.arange(0,3), (4,1))
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
Let's add these two arrays:
a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
Text book case of element wise addition of two 2D arrays.
Case2 :
Imagine a array like this:
[[0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]]
I want to add the following array to it:
[[0, 1, 2]]
Is it possible? Yes!
What broadcasting does is replicate the second array row wise 4 times to fit the size of first
array.
Here both array have same number of columns
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
b = np.arange(0,3)
b
array([0, 1, 2])
a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
The smaller array is broadcasted across the larger array so that they have compatible shapes.
Case 3:
Imagine I have two array like this:
[[0],
[10],
[20],
[30]]
and
[[0, 1, 2]]
i.e. one column matrix and one row matrix.
When we try to add these array up, broadcasting will replicate first array column wise 3 time and
secord array row wise 4 times to match up the shape.
a = np.arange(0,40,10)
a
array([ 0, 10, 20, 30])
This is a 1D row wise array, But we want this array colum wise? How do we do it ? Reshape?
a = a.reshape(4,1)
a
array([[ 0],
[10],
[20],
[30]])
b = np.arange(0,3)
b
array([0, 1, 2])
a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
Question: (for general broadcasting rules)
What will be the output of the following?
a = np.arange(8).reshape(2,4)
b = np.arange(16).reshape(4,4)
print(a*b)
a = np.arange(8).reshape(2,4)
a
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
b = np.arange(16).reshape(4,4)
b
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
a + b
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-306-bd58363a63fc> in <module>
----> 1 a + b
ValueError: operands could not be broadcast together with shapes (2,4)
(4,4)
Why didn't it work?
To understand this, let's learn about some General Broadcasting Rules
For each dimension ( going from right side)
1. The size of each dimension should be same OR
2. The size of one dimension should be 1
Rule 1 : If two array differ in the number of dimensions, the shape of one with fewer dimensions is
padded with ones on its leading( Left Side).
Rule 2 : If the shape of two arrays doesnt match in any dimensions, the array with shape equal to 1 is
stretched to match the other shape.
Rule 3 : If in any dimesion the sizes disagree and neither equal to 1 , then Error is raised.
In the above example, the shapes were (2,4) and (4,4).
Let's compare the dimension from right to left
• First, it will compare the right most dimension (4) which are equal.
• Next, it will compare the left dimension i.e. 2 and 4.
– Both conditions fail here. They are neither equal nor one of them is 1.
Hence, it threw an error while broadcasting.
Now, Let's take a look at few more examples
Question : Will broadcasting work in this case ?
A = np.arange(1,10).reshape(3,3)
B = np.array([-1, 0, 1])
A * B
A = np.arange(1,10).reshape(3,3)
A
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
B = np.array([-1, 0, 1])
B
array([-1, 0, 1])
A * B
array([[-1, 0, 3],
[-4, 0, 6],
[-7, 0, 9]])
Why did A * B work in this case?
• A has 3 rows and 3 columns i.e. (3,3)
• B is a 1-D vector with 3 elements (3,)
Now, if you look at rule 1
Rule 1 : If two array differ in the number of dimensions,
the shape of one with fewer dimensions is padded with ones on its
leading( Left Side).
What is the shape of A and B ?
• A has a shape of (3,3)
• B has a shape of (3,)
As per the rule 1,
• the shape of array with fewer dimensions will be prefixed with ones on its leading side.
Here, shape of B will be prefixed with 1
• So, it's shape will become (1,3)
Can we add a (3,3) and (1,3) array ?
We check the validity of broadcasting. i.e. if broadcasting is possible or not.
Checking the dimension from right to left.
• It will compare the right most dimension (3); which are equal
• Now, it compares the leading dimension.
– The size of one dimension is 1.
Hence, broadcasting condition is satisfied
How will it broadcast?
As per rule 2:
Rule 2 :
If the shape of two arrays doesnt match in any dimensions,
the array with shape equal to 1 is stretched to match the other shape.
Here, array B (1,3) will replicate/stretch its row 3 times to match shape of B
So , B gets broadcasted over A for each row of A
Question: Will broadcasting work in following case ?
A = np.arange(1,10).reshape(3,3)
B = np.arange(3, 10, 3).reshape(3,1)
C = A + B
A = np.arange(1,10).reshape(3,3)
A
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
B = np.arange(3, 10, 3).reshape(3,1)
B
array([[3],
[6],
[9]])
How did this A + B work?
• A has 3 rows and 3 columns i.e. shape (3,3)
• B has 3 rows and 1 column -i.e. shape (3,1)
Do we need to check rule 1 ?
Since, both arrays have same number of dimensions, we can ignore Rule 1.
Let's check whether broadcasting is possible or not
Now, for each dimension from right to left
• Right most dimension is 1.
• Leading dimension are matching (3)
So, conditions for broadcasting are met.
How will broadcasting happen?
As per rule 2, dimension with value 1 will be streched.
• A.shape => (3,3)
• B.shape => (3,1)
Hence, columns of B will be replicated/streched to match dimensions of A.
• So, B gets broadcasted on every column of A
C = A + B
np.round(C, 1)
array([[ 4, 5, 6],
[10, 11, 12],
[16, 17, 18]])
Dimension Expansion and Reducion
Recall that we learnt how to convert 1D array to 2D array in previous lectures
import numpy as np
arr = np.arange(6)
arr
array([0, 1, 2, 3, 4, 5])
arr.shape
(6,)
arr = arr.reshape(1,-1)
arr.shape
(1, 6)
This is also know as expanding dimensions
i.e. we expanded our dimension from 1D to 2D
We can also perform same operation using np.newaxis()
np.expand_dims()
• Expands the shape of an array with axis of length 1.
• Insert a new axis that will appear at the axis position in the expanded array shape.
Function signature: np.exapnd_dims(arr, axis)
Documentation:
https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html#numpy.expand_
dims
arr
array([[0, 1, 2, 3, 4, 5]])
Let's check the shape of arr
arr.shape
(1, 6)
Let's expand the dimensions
arr1 = np.expand_dims(arr, axis = 0 )
arr1
array([[[0, 1, 2, 3, 4, 5]]])
arr1.shape
(1, 1, 6)
What happened here?
Here, the shape of array is (6,)
• We only have one axis i.e. axis = 0.
When we expand dimension with axis =0,
• it add 1 to dimension @ axis = 0
• Shape becomes (1, 6) from (6,)
• i.e. 1 is padded at the given axis location
Let's expand dims @ axis = 1
arr2 = np.expand_dims(arr, axis = 1)
arr2
array([[[0, 1, 2, 3, 4, 5]]])
arr2.shape
(1, 1, 6)
Notice that,
• as we provided axis =1 in argument,
• It expanded the shape along axis =1 i.e 1 was appened @ axis 1.
• Hence, shape become (6,1) from (6,)
We can also do same thing using np.newaxis
np.newaxis
• passed as a parameter to the array.
Let's see how it works
arr = np.arange(6)
arr[np.newaxis, :] #equivalent to np.expand_dims(arr, axis =0)
array([[0, 1, 2, 3, 4, 5]])
We basically passed np.newaxis at the axis position where we want to add an axis
• In arr[np.newaxis, : ],
– we passed it @ axis =0, hence shape 1 was added @ axis = 0
– and therefore, shape became (1, 6)
arr[:, np.newaxis] # equivalent to np.expand_dims(arr, axis = 1 )
array([[0],
[1],
[2],
[3],
[4],
[5]])
What if we want to reduce the number of dimensions?
We can use np.squeeze for reducing the dimensions
np.sqeeze()
• It removes the axis of length 1 from array.
• Inverse of expand_dims
Function signature: np.squeeze(arr, axis)
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html
arr = np.arange(9).reshape(1,1,9)
arr
array([[[0, 1, 2, 3, 4, 5, 6, 7, 8]]])
arr.shape
(1, 1, 9)
arr1 = np.squeeze(arr)
arr1
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
arr1.shape
(9,)
Notice that
• it reduced the shape from (1,1,9) to (9,)
• it did so by removing the axis of length 1
• i.e. it removed axis 0 and 1.
We can also remove specific axis using the axis argument
arr
array([[[0, 1, 2, 3, 4, 5, 6, 7, 8]]])
arr.shape
(1, 1, 9)
Let's remove axis = 1
arr1 = np.squeeze(arr, axis = 1 )
arr1
array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
arr1.shape
(1, 9)
What if we try to remove 2nd axis?
np.squeeze(arr, axis = 2 )
----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
<ipython-input-335-26d8de107e93> in <module>
----> 1 np.squeeze(arr, axis = 2 )
/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in
squeeze(*args, **kwargs)
/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py in
squeeze(a, axis)
1543 return squeeze()
1544 else:
-> 1545 return squeeze(axis=axis)
1546
1547
ValueError: cannot select an axis to squeeze out which has size not
equal to one
It'll throw an error
• as we are trying to remove non- one length axis
Views vs Copies (Shallow vs Deep Copy)
• Numpy manages memory very efficiently
• Which makes it really useful while dealing with large datasets
But how does it manage memory so efficiently?
• Let's create some arrays to understand what's happening in memory while using Numpy
# We'll create np array
a = np.arange(4)
a
array([0, 1, 2, 3])
# Reshape array `a` and store in b
b = a.reshape(2, 2)
b
array([[0, 1],
[2, 3]])
Now we will make some changes to our original array a
a[0] = 100
a
array([100, 1, 2, 3])
What will be values if we print array b ?
b
array([[100, 1],
[ 2, 3]])
Surprise Surprise!!
• Array b got automatically updated
This is an example of Numpy using "Shallow Copy" of data
Now, What happens here?
• Numpy re-uses data as much as possible instead of duplicating it
• This helps Numpy to be efficient
When we created b = a.reshape(2, 2)
• Numpy did NOT make a copy of a to store in b, as we can clearly see
• It is using the same data as in a
• It just looks different (reshaped) in b
• That is why, any changes in a automatically gets reflected in b
How data is stored using Numpy?
• Variable does NOT directly point to data stored in memory
• There is something called Header in-between
What does Header do?
• Variable points to header and header points to data stored in memory
• Header stores information about data - called Metadata
a is pointing to Metadata about our data [0, 1, 2, 3], which may include:
• How many values we have --> 4
• What is the Data Type of data --> int
• What's the Shape --> (4,)
• What's the stride i.e. step size --> 1
When we do b = a.reshape(2, 2)
• Numpy does NOT duplicate the data pointed to by a
• It uses the same data
• And create a New header for b that points to the same data as pointed to by a
b points to a new Header having different values of Metadata of the same data:
• Number of values --> 4
• Data Type --> int
• Shape --> (2, 2)
• Stride i.e. step size --> 1
That is why:
• When data is accessed using a, it gives data in shape (4,)
• And when data is accessed using b, it gives same data in shape (2, 2)
This helps Numpy to save time and space - Making it efficient
Now, Let's see an example where Numpy will create a "Deep Copy" of
data
Now, What if we do this?
Numpy metadata internals
a = np.arange(4)
a
array([0, 1, 2, 3])
# Create `c`
c = a + 2
c
array([2, 3, 4, 5])
# We make changes in a
a[0] = 100
a
array([100, 1, 2, 3])
array([2, 3, 4, 5])
As we can see, c did not get affected on changing a
• Because it is an operation
• A more permanent change in data
• So, Numpy had to create a separate copy for c - i.e., deep copy of array a for
array c
Conclusion:
• Numpy is able to use same data for simpler operations like reshape ---> Shallow
Copy
• It creates a copy of data where operations make more permanent changes to data
---> Deep Copy
Be careful about this while writing code using Numpy
Is there a way to check whether two arrays are sharing memory or not? Yes, there is
np.shares_memory() function to the rescue!!
a= np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b = a[::2]
b
array([0, 2, 4, 6, 8])
np.shares_memory(a,b)
True
Notice that Slicing creates shallow copies.
Why does slicing create shallow copies ?
Rememeber the stride param of the header.
• Stride is nothing but the step size.
For Array a, we have a stride of 1.
For creating array b,
• we are slicing array a by 2 i.e. stride 2.
• So, it creates a new header for array b with stride = 2 while pointing to the original data
b[0] = 2
b
array([2, 2, 4, 6, 8])
array([2, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Notice how change in b also changed the value in array a
Let's check with deep copy
array([2, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b = a +2
np.shares_memory(a,b)
False
We learnt how .reshape and Slicing returns a view of the original array
• i.e. Any changes made in original array will be reflected in the new array.
However, we saw that creating new array using
• masking or array operation returns deep copy of the array.
• Any changes made in new array are not reflected in the original array.
Numpy also provides us with few functions to make shallow/ deep copy
How to make shallow copy?
Numpy provides us with .view() function which returns view of an array
.view()
Returns view of the original array
• Any changes made in new array will be reflected in original array.
Function documentation:
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
view_arr = arr.view()
view_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
view_arr[4] = 420
view_arr
array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])
arr
array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])
Notice that changes in view array are reflected in original array.
How do we make deep copy ?
Numpy has .copy() function for that purpose
.copy()
Returns copy of the array.
Documentation (.copy()):
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.c
opy
Documentation: (np.copy()):
https://numpy.org/doc/stable/reference/generated/numpy.copy.html
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
copy_arr = arr.copy()
copy_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Let's modify the content of copy_arr and check whether it modified the original array as well
copy_arr[3] = 45
copy_arr
array([ 0, 1, 2, 45, 4, 5, 6, 7, 8, 9])
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Notice that
• The content of original array were not modified as we changed our copy array.
What are object arrays ?
Object arrays are basically array of any python datatype.
Documentation: https://numpy.org/devdocs/reference/arrays.scalars.html#numpy.object_
arr = np.array([1, 'm', [1,2,3]], dtype = 'object')
arr
array([1, 'm', list([1, 2, 3])], dtype=object)
But arrays are suppoed to be homogeous data. How is it storing data of various
types?
Remember that everything is object in python.
Just like python list,
• The data actually stored in object arrays are references to Python objects, not the
objects themselves.
Hence, their elements need not be of the same Python type.
As every element in array is an object. Hence, the dtype = object.
Let's make a copy of object array and check whether it returns a shallow copy or deep copy.
copy_arr = arr.copy()
copy_arr
array([1, 'm', list([1, 2, 3])], dtype=object)
Now, let's try to modify the list elements in copy_arr
copy_arr[2][0] = 999
copy_arr
array([1, 'm', list([999, 2, 3])], dtype=object)
Let's see if it changed the original array as well
arr
array([1, 'm', list([999, 2, 3])], dtype=object)
It did change the original array.
Hence, .copy() will return shallow copy when copying elements of array in object array.
Any change in the 2nd level elements of array will be reflected in original array as well.
So, how do we create deep copy then ?
We can do so using copy.deepcopy() method
copy.deepcopy()
Returns the deep copy of array
Documentation: https://docs.python.org/3/library/copy.html#copy.deepcopy
import copy
arr = np.array([1, 'm', [1,2,3]], dtype = 'object')
arr
array([1, 'm', list([1, 2, 3])], dtype=object)
Let's make a copy using deepcopy()
copy = copy.deepcopy(arr)
copy
array([1, 'm', list([1, 2, 3])], dtype=object)
Let's modify the array inside copy array
copy[2][0] = 999
copy
array([1, 'm', list([999, 2, 3])], dtype=object)
arr
array([1, 'm', list([1, 2, 3])], dtype=object)
Notice that,
• the changes in copy array didn't reflect back to original array.
copy.deepcopy() returns deep copy of an array.
Summarizing
• .view() returns shallow copy of array
• .copy() returns deep copy of an array except for object type array
• copy.deepcopy() returns deep copy of an array.