Machine Learning Part 02
Machine Learning Part 02
NumPy is an essential tool for data scientists, engineers, and researchers working on
numerical and scientific computing tasks in Python due to its efficiency, flexibility, and
extensive mathematical capabilities.
1. Arrays: NumPy's primary data structure is the ndarray (n-dimensional array). These
arrays are similar to Python lists but are more efficient for numerical operations because
they allow you to perform element-wise operations and take advantage of low-level
optimizations.
6. Integration with Other Libraries: NumPy is often used in conjunction with libraries like
SciPy for scientific computing, Matplotlib for data visualization, and pandas for data
manipulation and analysis.
The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory.
NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with less
code than is possible using Python’s built-in sequences.
1. np.array:
np.array is used to create NumPy arrays, which can be one-dimensional, two-
dimensional, or multi-dimensional.
You need to import NumPy with import numpy as np before using it.
You can create arrays from Python lists or nested lists.
In [ ]: # np.array
a = np.array([1,2,3])
print(a)
[1 2 3]
1. 2D and 3D Arrays:
NumPy allows you to create multi-dimensional arrays. You demonstrated 2D and
3D arrays using nested lists.
In [ ]: # 2D
n = np.array([[1,2,3],[4,5,6]])
print(n)
[[1 2 3]
[4 5 6]]
In [ ]: # 3d
c = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
print(c)
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
1. dtype:
You can specify the data type of the elements in a NumPy array using the dtype
parameter.
In [ ]: # dtype
np.array([1,2,3],dtype=float)
In [ ]: np.array([1,2,3],dtype=int)
array([1, 2, 3])
Out[ ]:
1. np.arange:
np.arange is used to create a range of values within a specified interval.
It generates an array of evenly spaced values, similar to Python's range function.
In [ ]: # np.arange
np.arange(1,11,2)
array([1, 3, 5, 7, 9])
Out[ ]:
1. Reshaping Arrays:
You can reshape an array using the reshape method. This changes the
dimensions of the array while maintaining the total number of elements.
In your example, you used reshape to create a 4D array from a 1D array.
In [ ]: # with reshape
np.arange(16).reshape(2,2,2,2)
array([[[[ 0, 1],
Out[ ]:
[ 2, 3]],
[[ 4, 5],
[ 6, 7]]],
[[[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15]]]])
1. np.ones:
np.ones creates an array filled with ones.
You specify the shape of the array as a tuple. For example, np.ones((3,4))
creates a 3x4 array filled with ones.
In [ ]: # np.ones
np.ones((3,4))
1. np.zeros:
np.zeros creates an array filled with zeros.
Similar to np.ones , you specify the shape of the array as a tuple.
In [ ]: #np.zeros
np.zeros((3,4))
1. np.random:
np.random.random generates random numbers between 0 and 1 in a specified
shape.
It's useful for generating random data for simulations or experiments.
In [ ]: # np.random
np.random.random((3,4))
1. np.linspace:
np.linspace generates evenly spaced values over a specified range.
You specify the start, end, and the number of values you want. In your example,
you used dtype=int to ensure integer values.
In [ ]: # np.linspace
np.linspace(-10,10,10,dtype=int)
Join Our WhatsApp for -8,
array([-10, Updates:
-6,https://lnkd.in/gEXBtVBA
-4, -2, 1, 3, 5, 7, 10])
Out[ ]:Our Telegram for Updates: https://lnkd.in/gEpetzaw
Join
1. np.identity:
np.identity creates an identity matrix, which is a square matrix with ones on
the diagonal and zeros elsewhere.
You specify the size of the identity matrix as a single integer.
In [ ]: # np.identity
np.identity(3)
These NumPy functions are essential tools for working with numerical data in Python. They
provide the foundation for many scientific and data analysis tasks, making it easier to
perform calculations and manipulate data efficiently.
Array Attributes
In [ ]: import numpy as np
a2 = np.arange(12,dtype=float).reshape(3,4)
a3 = np.arange(8).reshape(2,2,2)
print(a1)
print('----------')
print(a2)
print('----------')
print(a3)
[0 1 2 3 4 5 6 7 8 9]
----------
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
----------
[[[0 1]
[2 3]]
[[4 5]
[6 7]]]
ndim
This attribute returns the number of dimensions of a NumPy array. For example:
In [ ]: a1.ndim
1
Out[ ]:
In [ ]: a2.ndim
2
Out[ ]:
In [ ]: a3.ndim
3
Out[ ]:
shape
The shape attribute returns a tuple representing the dimensions of the array. For example:
In [ ]: print(a1.shape)
print(a2.shape)
print(a3.shape)
(10,)
(3, 4)
(2, 2, 2)
size
The size attribute returns the total number of elements in the array. For example:
In [ ]: print(a1.size)
print(a2.size)
print(a3.size)
a3
10
12
8
array([[[0, 1],
Out[ ]:
[2, 3]],
[[4, 5],
[6, 7]]])
itemsize
The itemsize attribute returns the size (in bytes) of each element in the array. For example:
In [ ]: a2.itemsize
8
Out[ ]:
In [ ]: a3.itemsize
a2.itemsize returns 8 because a2 is of type float64, which has 8 bytes per element.
a3.itemsize returns 4 because a3 is of type int32, which has 4 bytes per element.
dtype
The dtype attribute returns the data type of the elements in the array. For example:
In [ ]: print(a1.dtype)
print(a2.dtype)
print(a3.dtype)
int32
float64
int32
Changing Datatype
You can change the data type of an array using the astype method.
In [ ]: # astype
a3.astype(np.int32)
array([[[0, 1],
Out[ ]:
[2, 3]],
[[4, 5],
Join Our WhatsApp[6,
for Updates:
7]]]) https://lnkd.in/gEXBtVBA
Join Our Telegram for Updates: https://lnkd.in/gEpetzaw
Array operations
In [ ]: a1 = np.arange(12).reshape(3,4)
a2 = np.arange(12,24).reshape(3,4)
In [ ]: print(a1)
print('--------')
print(a2)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
--------
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
Scalar operations
These operations involve performing mathematical operations on the entire array with a
scalar value..
[[ 2 3 4 5]
[ 6 7 8 9]
[10 11 12 13]]
-----
[[ 0 1 4 9]
[ 16 25 36 49]
[ 64 81 100 121]]
-----
[[-5 -4 -3 -2]
[-1 0 1 2]
[ 3 4 5 6]]
In [ ]: a2 > 5
vector operations
These operations involve performing mathematical operations between two arrays.
In [ ]: # arithmetic
a1 ** a2
NumPy provides powerful tools for working with arrays efficiently, making it a valuable
library for scientific and numerical computations in Python.
a2 = np.arange(12).reshape(3,4)
a3 = np.arange(12,24).reshape(4,3)
print(a1)
print('------------')
print(a2)
print('-------------')
print(a3)
In [ ]: np.max(a1)
98.0
Out[ ]:
In [ ]: np.min(a1)
11.0
Out[ ]:
In [ ]: np.sum(a1)
528.0
Out[ ]:
In [ ]: np.prod(a1)
1661426069717568.0
Out[ ]:
np.max(a1,axis=0)
In [ ]: np.max(a1,axis=1)
In [ ]: np.min(a1,axis=0)
In [ ]: np.mean(a1)
58.666666666666664
Out[ ]:
In [ ]: np.median(a2)
5.5
Out[ ]:
In [ ]: np.std(a1)
28.85211334450987
Out[ ]:
In [ ]: np.var(a1)
832.4444444444446
Out[ ]:
In [ ]: np.median(a1,axis=0)
4. Trigonometric Functions:
You can apply trigonometric functions to an array, such as sine ( np.sin() ) and cosine
( np.cos() ), which perform element-wise calculations on the input array.
In [ ]: np.sin(a1)
In [ ]: np.cos(a1)
5. Dot Product:
The np.dot() function computes the dot product of two arrays, which is particularly
useful for matrix multiplication. In your example, you multiplied a2 and a3 using
np.dot() to obtain the result.
In [ ]: print(a2)
print('------------')
print(a3)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
------------
[[12 13 14]
[15 16 17]
[18 19 20]
[21 22 23]]
In [ ]: np.dot(a2,a3)
In [ ]: np.log(a1)
In [ ]: np.exp(a1)
In [ ]: a=np.random.random((2,3))*100
print(a)
In [ ]: np.round(a)
In [ ]: np.floor(a)
These NumPy functions and operations are essential tools for numerical and scientific
computing in Python, providing a wide range of capabilities for data manipulation, analysis,
and mathematical calculations.
NumPy Fundamentals(part-2)
In [ ]: import numpy as np
In [ ]: a1 = np.arange(10)
a2 = np.arange(12).reshape(3,4)
a3 = np.arange(8).reshape(2,2,2)
print(a1)
print('-------------')
print(a2)
print('-------------')
print(a3)
[0 1 2 3 4 5 6 7 8 9]
-------------
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
-------------
[[[0 1]
[2 3]]
[[4 5]
[6 7]]]
In [ ]: # For 1D array
a1
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Out[ ]:
5
6
[3 4 5 6 7 8]
In [ ]: # for 2D array
a2
array([[ 0, 1, 2, 3],
Out[ ]:
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
print(a2[1,0])
print(a2[2,3])
4
11
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
array([[1, 3],
Out[ ]:
[5, 7]])
In [ ]: a2[::2,1::2]
array([[ 1, 3],
Out[ ]:
[ 9, 11]])
In [ ]: a2[1,::3]
array([4, 7])
Out[ ]:
In [ ]: # For 3D Array
a3 = np.arange(27).reshape(3,3,3)
a3
array([[[ 0, 1, 2],
Out[ ]:
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
In [ ]: # how to extract 18
a3[2,0,0]
18
Out[ ]:
array([19, 20])
Out[ ]:
In [ ]: a3[::2,0,::2]
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
array([[ 0, 2],
Out[ ]:
[18, 20]])
In [ ]: a3[2,1:,1:]
array([[22, 23],
Out[ ]:
[25, 26]])
In [ ]: a3[0,1,:]
array([3, 4, 5])
Out[ ]:
Part 3 - Iterating:
Iterating through NumPy arrays using for loops.
Shows how to loop through elements in 1D and 2D arrays.
The np.nditer function is introduced for iterating through all elements in the array.
In [ ]: a1
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Out[ ]:
In [ ]: for i in a1:
print(i)
0
1
2
3
4
5
6
7
8
9
In [ ]: a2
array([[ 0, 1, 2, 3],
Out[ ]:
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
In [ ]: for i in a2:
print(i)
[0 1 2 3]
[4 5 6 7]
[ 8 9 10 11]
In [ ]: for i in np.nditer(a2):
print(i)
0
1
2
3
4
5
6
7 Join Our WhatsApp for Updates: https://lnkd.in/gEXBtVBA
8 Join Our Telegram for Updates: https://lnkd.in/gEpetzaw
9
10
11
In [ ]: a3
array([[[ 0, 1, 2],
Out[ ]:
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
In [ ]: for i in a3:
print(i)
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
[[0 1 2]
[3 4 5]
[6 7 8]]
[[ 9 10 11]
[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]
[24 25 26]]
In [ ]: # nditer
for i in np.nditer(a3):
print(i)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Part 4 - Reshaping:
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
In [ ]: a2
array([[ 0, 1, 2, 3],
Out[ ]:
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Reshape
In [ ]: np.reshape(a2,(4,3))
array([[ 0, 1, 2],
Out[ ]:
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
Transpose
In [ ]: a2
array([[ 0, 1, 2, 3],
Out[ ]:
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [ ]: np.transpose(a2)
array([[ 0, 4, 8],
Out[ ]:
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
In [ ]: a2.T
array([[ 0, 4, 8],
Out[ ]:
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
raval
In [ ]: a3.ravel()
Stacking
In [ ]: # horizontal stacking
a4 = np.arange(12).reshape(3,4)
a5 = np.arange(12,24).reshape(3,4)
a5
In [ ]: np.hstack((a4,a5))
In [ ]: # Vertical stacking
np.vstack((a4,a5))
10/13/23, 10:14 AM Day24 - NumPy Fundamentals(part-2)
array([[ 0, 1, 2, 3],
Out[ ]:
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]])
10/14/23, 10:29 AM Day25 - Broadcasting-NumPy
The term broadcasting describes how NumPy treats arrays with different shapes during
arithmetic operations.
The smaller array is “broadcast” across the larger array so that they have compatible
shapes.
In [ ]: import numpy as np
In [ ]: # same shape
a = np.arange(6).reshape(2,3)
b = np.arange(6,12).reshape(2,3)
print(a)
print('------------')
print(b)
print('------------')
print(a+b)
[[0 1 2]
[3 4 5]]
------------
[[ 6 7 8]
[ 9 10 11]]
------------
[[ 6 8 10]
[12 14 16]]
In [ ]: # diff shape
a = np.arange(6).reshape(2,3)
b = np.arange(3).reshape(1,3)
print(a)
print('------------')
print(b)
print('------------')
print(a+b)
Broadcasting Rules
1. Make the two arrays have the same number of dimensions.: If two arrays have
different numbers of dimensions, NumPy will pad the smaller shape with ones on the left
side, making the shapes compatible for element-wise operations.
2. Make each dimension of the two arrays the same size.: If the shapes of the two arrays
do not match in any dimension, NumPy will try to stretch the smaller dimension to match
the larger one, provided that the smaller dimension's size is 1. If stretching is not possible, a
"ValueError" will be raised.
3.Dimension: If the sizes of the dimensions are not 1 but still do not match, NumPy will
raise a "ValueError."
Examples
Broadcasting Example 1: The shapes of a and b are (4,3) and (3,), respectively, and
broadcasting is successful.
In [ ]: a = np.arange(12).reshape(4,3)
b = np.arange(3)
print(a)
print('------------')
print(b)
print('------------')
print(a+b)
Example 2: Broadcasting does not work when the shapes of two arrays cannot be made
compatible.
In [ ]: a = np.arange(12).reshape(3,4)
b = np.arange(3)
print(a)
print('------------')
print(b)
print('------------')
print(a+b)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
------------
[0 1 2]
------------
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
c:\Users\disha\Downloads\Pythoon 100 Days\100_Days_OF_Python\Day25 - Broadcasting-
NumPy.ipynb Cell 10 line 8
<a href='vscode-notebook-cell:/c%3A/Users/disha/Downloads/Pythoon%20100%20Da
ys/100_Days_OF_Python/Day25%20-%20Broadcasting-NumPy.ipynb#X20sZmlsZQ%3D%3D?line=
5'>6</a> print(b)
<a href='vscode-notebook-cell:/c%3A/Users/disha/Downloads/Pythoon%20100%20Da
ys/100_Days_OF_Python/Day25%20-%20Broadcasting-NumPy.ipynb#X20sZmlsZQ%3D%3D?line=
6'>7</a> print('------------')
----> <a href='vscode-notebook-cell:/c%3A/Users/disha/Downloads/Pythoon%20100%20Da
ys/100_Days_OF_Python/Day25%20-%20Broadcasting-NumPy.ipynb#X20sZmlsZQ%3D%3D?line=
7'>8</a> print(a+b)
ValueError: operands could not be broadcast together with shapes (3,4) (3,)
The shapes of a and b are (3,4) and (3,), respectively, which are not compatible for
broadcasting, resulting in a "ValueError."
Broadcasting Example 3: The shapes of a and b are (1,3) and (3,1), respectively, and
broadcasting is successful.
In [ ]: a = np.arange(3).reshape(1,3)
b = np.arange(3).reshape(3,1)
print(a)
print('------------')
print(b)
print('------------')
print(a+b)
In [ ]: a = np.arange(3).reshape(1,3)
b = np.arange(4).reshape(4,1)
print(a)
print('------------')
print(b)
print('------------')
print(a + b)
[[0 1 2]]
------------
[[0]
[1]
[2]
[3]]
------------
[[0 1 2]
[1 2 3]
[2 3 4]
[3 4 5]]
Broadcasting Example 4: The shape of 'a' is (1,1), and the shape of 'b' is (2,2), and
broadcasting is successful.
In [ ]: a = np.array([1])
# shape -> (1,1)
b = np.arange(4).reshape(2,2)
# shape -> (2,2)
print(a)
print('------------')
print(b)
print('------------')
print(a+b)
[1]
------------
[[0 1]
[2 3]]
------------
[[1 2]
[3 4]]
Broadcasting Example 5: The shapes of 'a' and 'b' are (3,4) and (4,3), which are not
compatible for broadcasting, resulting in a "ValueError."
In [ ]: a = np.arange(12).reshape(3,4)
b = np.arange(12).reshape(4,3)
print(a)
print('------------')
print(b)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
------------
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
------------
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
c:\Users\disha\Downloads\Pythoon 100 Days\100_Days_OF_Python\Day25 - Broadcasting-
NumPy.ipynb Cell 14 line 8
<a href='vscode-notebook-cell:/c%3A/Users/disha/Downloads/Pythoon%20100%20Da
ys/100_Days_OF_Python/Day25%20-%20Broadcasting-NumPy.ipynb#X24sZmlsZQ%3D%3D?line=
5'>6</a> print(b)
<a href='vscode-notebook-cell:/c%3A/Users/disha/Downloads/Pythoon%20100%20Da
ys/100_Days_OF_Python/Day25%20-%20Broadcasting-NumPy.ipynb#X24sZmlsZQ%3D%3D?line=
6'>7</a> print('------------')
----> <a href='vscode-notebook-cell:/c%3A/Users/disha/Downloads/Pythoon%20100%20Da
ys/100_Days_OF_Python/Day25%20-%20Broadcasting-NumPy.ipynb#X24sZmlsZQ%3D%3D?line=
7'>8</a> print(a+b)
ValueError: operands could not be broadcast together with shapes (3,4) (4,3)
NumPy_Coding_Demonstrations
Working with mathematical formulas(with the help of
NumPy arrays)
In [ ]: import numpy as np
The sigmoid function is a common activation function used in machine learning and
neural networks.
def sigmoid(array):
return 1/(1 + np.exp(-(array)))
a = np.arange(100)
sigmoid(a)
MSE is a common loss function used in regression problems to measure the average
squared difference between actual and predicted values.
def mse(actual,predicted):
return np.mean((actual - predicted)**2)
actual = np.random.randint(1,50,25)
predicted = np.random.randint(1,50,25)
mse(actual,predicted)
384.12
Out[ ]:
384.12
Out[ ]:
The code demonstrates how to create an array containing missing values and check for
the presence of 'nan' values using the 'np.isnan' function.
To remove missing values, you can use boolean indexing with the '~' operator to
select and keep only the non-missing values.
This code illustrates how to create and plot different types of 2D graphs using NumPy
arrays and Matplotlib.
file:///C:/Users/disha/Downloads/Day26 -NumPy_Coding_Demonstrations.ipynb .html 2/6
10/15/23, 10:58 AM Day26 -NumPy_Coding_Demonstrations.ipynb
It provides examples of plotting a straight line (y = x), a quadratic curve (y = x^2), a sine
wave (y = sin(x)), and a more complex function involving the sigmoid activation
function.
Matplotlib is used for creating and displaying the plots.
Each 'x' and 'y' array is generated using NumPy functions, and 'plt.plot' is used to create
the plots.
In [ ]: # plotting a 2D plot
# x = y
x = np.linspace(-10,10,100)
y = x
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x27a9450bfd0>]
Out[ ]:
In [ ]: # y = x^2
x = np.linspace(-10,10,100)
y = x**2
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x27a9422f100>]
Out[ ]:
In [ ]: # y = sin(x)
x = np.linspace(-10,10,100)
y = np.sin(x)
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x27a946160d0>]
Out[ ]:
In [ ]: # y = xlog(x)
x = np.linspace(-10,10,100)
y = x * np.log(x)
plt.plot(x,y)
C:\Users\disha\AppData\Local\Temp\ipykernel_14080\2564014901.py:3: RuntimeWarning:
invalid value encountered in log
y = x * np.log(x)
[<matplotlib.lines.Line2D at 0x27a9469cd60>]
Out[ ]:
In [ ]: # sigmoid
x = np.linspace(-10,10,100)
y = 1/(1+np.exp(-x))
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x27a9584eb20>]
Out[ ]:
Pandas is an open-source data manipulation and analysis library for the Python
programming language. It provides easy-to-use data structures and data analysis tools for
working with structured data, such as tabular data (like spreadsheets or SQL tables). The
name "pandas" is derived from the term "panel data," which is a type of multi-dimensional
data set commonly used in statistics and econometrics.
Pandas is particularly well-suited for tasks such as data cleaning, data transformation, and
data analysis. It offers two primary data structures:
Pandas provides a wide range of functions and methods for data manipulation and analysis,
including:
Data cleaning: handling missing data, data imputation, and data alignment.
Data filtering and selection.
Aggregation and summarization of data.
Data merging and joining.
Time series data manipulation.
Reading and writing data from/to various file formats, such as CSV, Excel, SQL
databases, and more.
Pandas is an essential tool for data scientists, analysts, and anyone working with data in
Python. It is often used in conjunction with other libraries, such as NumPy for numerical
computations and Matplotlib or Seaborn for data visualization.
1. DataFrame:
In [ ]: import pandas as pd
df = pd.DataFrame(data)
1. Series:
A Series is a one-dimensional data structure that can be thought of as a single
column or row from a DataFrame.
Each element in a Series is associated with a label, called an index.
Series can hold various data types, including numbers, text, and dates.
In [ ]: import pandas as pd
Pandas provides a wide range of operations and functionality for working with data,
including:
1. Data Cleaning:
Handling missing data: Pandas provides methods like isna() , fillna() , and
dropna() to deal with missing values.
Data imputation: You can fill missing values with meaningful data using methods
like fillna() or statistical techniques.
2. Data Selection and Filtering:
You can select specific rows and columns, filter data based on conditions, and use
boolean indexing to retrieve relevant data.
3. Data Aggregation and Summarization:
You can merge data from multiple DataFrames using functions like merge() and
concat() .
This is particularly useful when working with multiple data sources.
5. Time Series Data Manipulation:
Pandas has built-in support for working with time series data, making it simple to
perform operations on time-based data.
6. Reading and Writing Data:
Pandas can read data from various file formats, including CSV, Excel, SQL
databases, JSON, and more, using functions like read_csv() , read_excel() ,
and read_sql() .
It can also write DataFrames back to these formats using functions like to_csv()
and to_excel() .
In summary, Pandas is a powerful Python library for data manipulation and analysis that
simplifies working with structured data, making it a valuable tool for anyone dealing with
data in Python. Its flexibility and extensive functionality make it an essential part of the data
science toolkit.
A pandas Series is a one-dimensional data structure in the pandas library, which is a popular
Python library for data manipulation and analysis. It can be thought of as a labeled array or a
column in a spreadsheet or a single column in a SQL table. Each element in a Series is
associated with a label or index, allowing for easy and efficient data manipulation and
analysis.
1. Homogeneous Data: All elements in a Series must be of the same data type, such as
integers, floats, strings, or even more complex data structures like other Series or
dataframes.
2. Labels: Each element in a Series is associated with a label or an index. You can think of
the index as a unique identifier for each element in the Series. The labels can be
integers, strings, or other types.
3. Powerful Data Operations: Pandas Series allows you to perform various operations like
filtering, slicing, mathematical operations, and more on the data elements efficiently.
You can create a pandas Series from various data sources, such as Python lists, NumPy
arrays, or dictionaries. Here's an example of creating a Series from a Python list:
In [ ]: import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
You can access data in a Series using its labels or positions, and you can perform various
operations like filtering, aggregating, and applying functions to the data. Series are often
used as building blocks for more complex data structures like pandas DataFrames, which are
essentially collections of Series organized in a tabular structure.
Importing Pandas:
Importing the pandas library is done using the import pandas as pd statement. It's
considered best practice to also import numpy when working with pandas since numpy is
often used internally for data manipulation.
In [ ]: import numpy as np # it is a best prctise with pandas always import numpy as well
import pandas as pd
In [ ]: # string
pd.Series(country)
0 India
Out[ ]:
1 Pakistan
2 USA
dtype: object
In [ ]: # integers
runs = [67,38,35]
runs_series = pd.Series(runs)
print(runs_series)
0 67
1 38
2 35
dtype: int64
You can specify custom index labels for a Series when creating it from a list. This allows
you to associate each element with a specific label.
In [ ]: # custom Index
marks = [67,54,89,100]
subjects = ['Math','English','SocialScience','Marathi']
pd.Series(marks,index=subjects)
file:///C:/Users/disha/Downloads/Day28 - Pandas_series.html 2/5
10/17/23, 10:01 AM Day28 - Pandas_series
Math 67
Out[ ]:
English 54
SocialScience 89
Marathi 100
dtype: int64
You can set a name for the Series when creating it. This name can be used to label the
Series, making it more descriptive
Marks = pd.Series(marks,index=subjects,name="Exam_Marks")
Marks
Math 67
Out[ ]:
English 54
SocialScience 89
Marathi 100
Name: Exam_Marks, dtype: int64
In [ ]: marks = {
'maths':67,
'english':57,
'science':89,
'hindi':100
}
marks_series = pd.Series(marks,name='marks')
marks_series
maths 67
Out[ ]:
english 57
science 89
hindi 100
Name: marks, dtype: int64
In [ ]: marks
In [ ]: # size
marks_series.size
4
Out[ ]:
The .dtype attribute of a Series returns the data type of the elements in the Series. In
this case, it's 'int64', indicating integers.
In [ ]: # dtype
marks_series.dtype
dtype('int64')
Out[ ]:
In [ ]: # name
marks_series.name
'marks'
Out[ ]:
In [ ]: # is_unique
marks_series.is_unique
pd.Series([1,1,2,3,4,5]).is_unique
False
Out[ ]:
In [ ]: # index
marks_series.index
In [ ]: # values
marks_series.values
These features and attributes make pandas Series a versatile and powerful data structure for
working with one-dimensional data, making it a fundamental tool for data manipulation and
analysis in Python.
file:///C:/Users/disha/Downloads/Day28 - Pandas_series.html 4/5
10/17/23, 10:01 AM Day28 - Pandas_series
In [ ]: import pandas as pd
import numpy as np
In [ ]: import warnings
warnings.filterwarnings("ignore")
In [ ]: vk = pd.read_csv('/content/kohli_ipl.csv',index_col='match_no',squeeze=True)
vk
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
The code then displays the 'vk' Series, which represents data related to cricket matches
with the match number as the index and the number of runs scored by a player.
In [ ]: movies = pd.read_csv('/content/bollywood.csv',index_col='movie',squeeze=True)
movies
The 'movies' Series is displayed, containing data related to Bollywood movies with the
movie title as the index and the lead actor's name as the data.
Series methods
In [ ]: # head and tail
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
6 9
7 34
8 0
9 21
10 3
Name: runs, dtype: int64
match_no
Out[ ]:
211 0
212 20
213 73
214 25
215 7
Name: runs, dtype: int64
movie
Out[ ]:
Dhund (2003 film) Amar Upadhyaya
Name: lead, dtype: object
movie
Out[ ]:
Halla Bol Ajay Devgn
Shaadi No. 1 Fardeen Khan
Karma Aur Holi Rati Agnihotri
Patiala House (film) Rishi Kapoor
Chaalis Chauraasi Naseeruddin Shah
Name: lead, dtype: object
In [ ]: # `value_counts()`: Counts the number of occurrences of each lead actor in the 'mov
movies.value_counts()
Akshay Kumar 48
Out[ ]:
Amitabh Bachchan 45
Ajay Devgn 38
Salman Khan 31
Sanjay Dutt 26
..
Diganth 1
Parveen Kaur 1
Seema Azmi 1
Akanksha Puri 1
Edwin Fernandes 1
Name: lead, Length: 566, dtype: int64
match_no
Out[ ]:
87WhatsApp0 for Updates: https://lnkd.in/gEXBtVBA
Join Our
Join Our Telegram 0for Updates: https://lnkd.in/gEpetzaw
211
207 0
206 0
91 0
...
164 100
120 100
123 108
126 109
128 113
Name: runs, Length: 215, dtype: int64
113
Out[ ]:
In [ ]: # sort_index
movies.sort_index(ascending=False)
movie
Out[ ]:
Zor Lagaa Ke...Haiya! Meghan Jadhav
Zokkomon Darsheel Safary
Zindagi Tere Naam Mithun Chakraborty
Zindagi Na Milegi Dobara Hrithik Roshan
Zindagi 50-50 Veena Malik
...
2 States (2014 film) Arjun Kapoor
1971 (2007 film) Manoj Bajpayee
1920: The Evil Returns Vicky Ahuja
1920: London Sharman Joshi
1920 (film) Rajniesh Duggall
Name: lead, Length: 1500, dtype: object
In-Place Sorting:
The code shows how to perform in-place sorting by using the inplace=True
argument with the sort_index() method for the 'movies' Series.
In [ ]: movies
movie
Out[ ]:
Zor Lagaa Ke...Haiya! Meghan Jadhav
Zokkomon Darsheel Safary
Zindagi Tere Naam Mithun Chakraborty
Zindagi Na Milegi Dobara Hrithik Roshan
Zindagi 50-50 Veena Malik
...
2 States (2014 film) Arjun Kapoor
1971 (2007 film) Manoj Bajpayee
1920: The Evil Returns Vicky Ahuja
1920: London Sharman Joshi
1920 (film) Rajniesh Duggall
Name: lead, Length: 1500, dtype: object
215
Out[ ]:
In [ ]: # `sum()`: Calculates the total runs scored by the player in the 'vk' Series
vk.sum()
6634
Out[ ]:
print(vk.mean())
print('---------')
print(vk.median())
print('---------')
print(movies.mode())
print('---------')
print(vk.std())
print('---------')
print(vk.var())
30.855813953488372
---------
24.0
---------
0 Akshay Kumar
Name: lead, dtype: object
---------
26.22980132830278
---------
688.0024777222343
In [ ]: # min/max
vk.min()
0
Out[ ]:
In [ ]: vk.max()
Join Our WhatsApp for Updates: https://lnkd.in/gEXBtVBA
113
Join Our Telegram for Updates: https://lnkd.in/gEpetzaw
Out[ ]:
In [ ]: # `describe()`: Provides summary statistics for the 'vk' Series, including count, m
vk.describe()
count 215.000000
Out[ ]:
mean 30.855814
std 26.229801
min 0.000000
25% 9.000000
50% 24.000000
75% 48.000000
max 113.000000
Name: runs, dtype: float64
In [ ]: import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
In [ ]: # import dataset
vk = pd.read_csv('/content/kohli_ipl.csv',index_col='match_no',squeeze=True)
vk
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [ ]: movies = pd.read_csv('/content/bollywood.csv',index_col='movie',squeeze=True)
movies
movie
Out[ ]:
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
astype:
The astype method is used to change the data type of the elements in a Pandas
Series. In your example, you used it to change the data type of the 'vk' Series from
'int64' to 'int16', which can reduce memory usage if you're dealing with large datasets.
In [ ]: # astype
import sys
sys.getsizeof(vk)
3456
Out[ ]:
In [ ]: vk.astype('int16')
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int16
In [ ]: sys.getsizeof(vk.astype('int16'))
2166
Out[ ]:
between:
The between method is used to filter a Series to include only elements that fall within
a specified range. In your example, you used it to filter the 'vk' Series to include only
values between 51 and 99.
In [ ]: # between
vk.between(51,99)
match_no
Out[ ]:
1 False
2 False
3 False
4 False
5 False
...
211 False
212 False
213 True
214 False
215 False
Name: runs, Length: 215, dtype: bool
In [ ]: # between
vk[vk.between(51,99)].size
43
Out[ ]:
clip:
The clip method is used to limit the values in a Series to a specified range. It replaces
values that are below the lower bound with the lower bound and values above the
upper bound with the upper bound. This can be useful for handling outliers or ensuring
data falls within a certain range.
In [ ]: # clip
vk
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [ ]: vk.clip(50,80)
match_no
Out[ ]:
1 50
2 50
3 50
4 50
5 50
..
211 50
212 50
213 73
214 50
215 50
Name: runs, Length: 215, dtype: int64
drop_duplicates:
The drop_duplicates method is used to remove duplicate values from a Series. It
returns a new Series with only the unique values. In your example, you used it with the
'temp' Series to remove duplicate values.
In [ ]: # drop_duplicates
temp = pd.Series([1,1,2,2,3,3,4,4])
temp
0 1
Out[ ]:
1 1
2 2
3 2
4 3
5 3
6 4
7 4
dtype: int64
In [ ]: temp.duplicated().sum()
In [ ]: temp.drop_duplicates()
0 1
Out[ ]:
2 2
4 3
6 4
dtype: int64
isnull:
The isnull method is used to check for missing or NaN (Not-a-Number) values in a
Series. It returns a Boolean Series where 'True' indicates missing values and 'False'
indicates non-missing values. In your example, you used it to find missing values in the
'temp' Series.
In [ ]: # isnull
temp = pd.Series([1,2,3,np.nan,5,6,np.nan,8,np.nan,10])
temp
0 1.0
Out[ ]:
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
7 8.0
8 NaN
9 10.0
dtype: float64
In [ ]: temp.isnull().sum()
3
Out[ ]:
dropna:
The dropna method is used to remove missing values from a Series. It returns a new
Series with the missing values removed. In your example, you used it to remove missing
values from the 'temp' Series.
In [ ]: # dropna
temp.dropna()
0 1.0
Out[ ]:
1 2.0
2 3.0
4 5.0
5 6.0
7 8.0
9 10.0
dtype: float64
fillna:
The fillna method is used to fill missing values in a Series with specified values. It
can be used to replace missing data with a specific value, such as the mean of the non-
missing values. In your example, you filled missing values in the 'temp' Series with the
mean of the non-missing values.
In [ ]: # fillna
temp.fillna(temp.mean())
0 1.0
Out[ ]:
1 2.0
2 3.0
3 5.0
4 5.0
5 6.0
6 5.0
7 8.0
8 5.0
9 10.0
dtype: float64
isin:
The isin method is used to filter a Series to include only elements that match a list of
values. In your example, you used it to filter the 'vk' Series to include only values that
match either 49 or 99.
In [ ]: # isin
vk[vk.isin([49,99])]
match_no
Out[ ]:
82 99
86 49
Name: runs, dtype: int64
apply:
The apply method is used to apply a function to each element of a Series. In your
example, you applied a lambda function to the 'movies' Series to extract the first word
of each element and convert it to uppercase.
In [ ]: # apply
movies
movie
Out[ ]:
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
In [ ]: movies.apply(lambda x:x.split()[0].upper())
copy:
The copy method is used to create a copy of a Series. This copy is separate from the
original Series, and any modifications to the copy won't affect the original Series. In
your example, you created a copy of the 'vk' Series using the 'copy' method and
modified the copy without affecting the original Series.
In [ ]: # copy
vk
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [ ]: new = vk.head().copy()
In [ ]: new
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
In [ ]: new[1] = 100
In [ ]: new
match_no
Out[ ]:
1 100
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
In [ ]: vk.head()
match_no
Out[ ]:
1 1
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
plot:
The plot method is used to create visualizations of data in a Series. You can specify
the type of plot (e.g., 'line', 'bar', 'pie') and customize various plot attributes. In your
example, you used it to create a pie chart of the top 20 most common values in the
'movies' Series.
In [ ]: # plot
movies.value_counts().head(20).plot(kind='pie')
<Axes: ylabel='lead'>
Out[ ]:
Creating DataFrames:
DataFrames can be created in various ways. You demonstrated creating DataFrames
using lists and dictionaries. Lists represent rows, and dictionaries represent columns.
Using List
In [ ]: # using lists
student_data = [
[100,80,10],
[90,70,7],
[120,100,14],
[80,50,2]
]
pd.DataFrame(student_data,columns=['iq','marks','package'])
0 100 80 10
1 90 70 7
2 120 100 14
3 80 50 2
Using Dictionary
In [ ]: # using dictionary
student_dict = {
'name':['nitish','ankit','rupesh','rishabh','amit','ankita'],
'iq':[100,90,120,80,0,0],
'marks':[80,70,100,50,0,0],
'package':[10,7,14,2,0,0]
}
students = pd.DataFrame(student_dict)
students
0 nitish 100 80 10
1 ankit 90 70 7
3 rishabh 80 50 2
4 amit 0 0 0
5 ankita 0 0 0
You can also create DataFrames by reading data from CSV files using the
pd.read_csv() function.
In [ ]: # using read_scv
ipl = pd.read_csv('ipl-matches.csv')
ipl
Narendra
2022- Rajasthan Gujarat Modi
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadium,
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadium,
Bangalore
Ahmedabad
Eden
2022- Rajasthan Gujarat
3 1312197 Kolkata 2022 Qualifier 1 Gardens,
05-24 Royals Titans
Kolkata
Wankhede
2022- Sunrisers Punjab
4 1304116 Mumbai 2022 70 Stadium,
05-22 Hyderabad Kings
Mumbai
Kolkata
2008- Deccan Eden
945 335986 Kolkata 2007/08 4 Knight
04-20 Chargers Gardens
Riders
Royal
2008- Mumbai Wankhede
946 335985 Mumbai 2007/08 5 Challengers
04-20 Indians Stadium
Bangalore
Punjab
Chennai Cricket
2008- Kings XI
948 335983 Chandigarh 2007/08 2 Super Association
04-19 Punjab
Kings Stadium,
Mohali
Royal Kolkata M
2008-
949 335982 Bangalore 2007/08 1 Challengers Knight Chinnaswamy
04-18
Bangalore Riders Stadium
In [ ]: movies = pd.read_csv('movies.csv')
In [ ]: movies
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/
Strike
Battalion
1 tt9472208 NaN https://en.wikipedia.
609
The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/w
Minister
(film)
Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org
India
Evening
4 tt6028796 NaN https://en.wikipedia.org/
Shadows
Tera Mera
1624 Saath tt0301250 https://upload.wikimedia.org/wikipedia/en/2/2b... https://en.wikipedia.org/w
Rahen
Yeh
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/w
Ka Safar
Sabse
1626 Bada tt0069204 NaN https://en.wikipedia.org/
Sukh
Attributes of DataFrames:
DataFrames have several attributes that provide information about their structure and
content:
(950, 20)
Out[ ]:
In [ ]: movies.shape
(1629, 18)
Out[ ]:
ID int64
Out[ ]:
City object
Date object
Season object
MatchNumber object
Team1 object
Team2 object
Venue object
TossWinner object
TossDecision object
SuperOver object
WinningTeam object
WonBy object
Margin float64
method object
Player_of_Match object
Team1Players object
Team2Players object
Umpire1 object
Umpire2 object
dtype: object
In [ ]: movies.dtypes
title_x object
Out[ ]:
imdb_id object
poster_path object
wiki_link object
title_y object
original_title object
is_adult int64
year_of_release int64
runtime object
genres object
imdb_rating float64
imdb_votes int64
story object
summary object
tagline object
actors object
wins_nominations object
release_date object
dtype: object
In [ ]: ipl.columns
In [ ]: students.columns
Viewing Data:
To view the data in a DataFrame, you can use methods like head() , tail() , and
sample() to see the first few rows, last few rows, or random sample rows,
respectively.
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/U
Strike
Battalion
1 tt9472208 NaN https://en.wikipedia.org/w
609
The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/T
Minister
(film)
Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/
India
Evening
4 tt6028796 NaN https://en.wikipedia.org/wiki/E
Shadows
In [ ]: # sample
ipl.sample(5)
Kolkata
2015- Sunrisers Eden
455 829781 Kolkata 2015 38 Knight
05-04 Hyderabad Gardens H
Riders
Dr DY Patil
Chennai Royal
2022- Sports
52 1304068 Mumbai 2022 22 Super Challengers Ch
04-12 Academy,
Kings Bangalore B
Mumbai
Royal M
2012- Pune
681 548327 Bangalore 2012 21 Challengers Chinnaswamy
04-17 Warriors
Bangalore Stadium
Chennai
2011- Mumbai Wankhede
752 501221 Mumbai 2011 25 Super
04-22 Indians Stadium Su
Kings
You can obtain information about a DataFrame using the info() method, which
provides data types, non-null counts, and memory usage.
In [ ]: # info
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title_x 1629 non-null object
1 imdb_id 1629 non-null object
2 poster_path 1526 non-null object
3 wiki_link 1629 non-null object
4 title_y 1629 non-null object
5 original_title 1629 non-null object
6 is_adult 1629 non-null int64
7 year_of_release 1629 non-null int64
8 runtime 1629 non-null object
9 genres 1629 non-null object
10 imdb_rating 1629 non-null float64
11 imdb_votes 1629 non-null int64
12 story 1609 non-null object
13 summary 1629 non-null object
14 tagline 557 non-null object
15 actors 1624 non-null object
16 wins_nominations 707 non-null object
17 release_date 1522 non-null object
dtypes: float64(1), int64(3), object(14)
memory usage: 229.2+ KB
For numerical columns, you can use the describe() method to get statistics like
count, mean, standard deviation, min, max, and quartiles.
In [ ]: ipl.describe()
Out[ ]: ID Margin
Missing Data:
The isnull() function helps check for missing data (NaN values) in a DataFrame.
The sum() function can be used to count the missing values in each column.
In [ ]: # isnull
movies.isnull().sum()
Duplicated Rows:
You can check for duplicate rows in a DataFrame using the duplicated() function. It
returns the number of duplicated rows.
In [ ]: # duplicated
movies.duplicated().sum()
0
Out[ ]:
In [ ]: students.duplicated().sum()
0
Out[ ]:
Column Renaming:
You can rename columns in a DataFrame using the rename() function. It can be
performed temporarily or with permanent changes if you set the inplace parameter
to True .
In [ ]: # rename
students
0 nitish 100 80 10
1 ankit 90 70 7
3 rishabh 80 50 2
4 amit 0 0 0
5 ankita 0 0 0
In [ ]: students.rename(columns={'package':'package_lpa'})
0 nitish 100 80 10
1 ankit 90 70 7
3 rishabh 80 50 2
4 amit 0 0 0
5 ankita 0 0 0
In [ ]: students
0 nitish 100 80 10
1 ankit 90 70 7
3 rishabh 80 50 2
4 amit 0 0 0
5 ankita 0 0 0
These File provide a basic understanding of working with DataFrames in pandas, including
creating, reading, exploring, and modifying them. It's important to note that these are
fundamental operations, and pandas offers many more capabilities for data manipulation
and analysis.
1.
2.
3.
4.
These are fundamental operations when working with Pandas DataFrames. They are useful
for data manipulation and analysis, allowing you to extract specific information from your
data.
Creating DataFrames:
The code demonstrates how to create Pandas DataFrames using different methods:
Lists: You can create a DataFrame from a list of lists, where each inner list represents a
row.
Dictionaries: You can create a DataFrame from a dictionary where keys become column
names.
Reading from CSV: DataFrames can be created by reading data from a CSV file.
In [ ]: # using lists
student_data = [
[100,80,10],
[90,70,7],
[120,100,14],
[80,50,2]
]
pd.DataFrame(student_data,columns=['iq','marks','package'])
0 100 80 10
1 90 70 7
2 120 100 14
3 80 50 2
In [ ]: # using dicts
student_dict = {
'name':['nitish','ankit','rupesh','rishabh','amit','ankita'],
'iq':[100,90,120,80,0,0],
file:///C:/Users/disha/Downloads/Day32 - Pandas DataFrame Operations.html 1/11
10/21/23, 10:20 AM Day32 - Pandas DataFrame Operations
'marks':[80,70,100,50,0,0],
'package':[10,7,14,2,0,0]
}
students = pd.DataFrame(student_dict)
students.set_index('name',inplace=True)
students
name
nitish 100 80 10
ankit 90 70 7
rishabh 80 50 2
amit 0 0 0
ankita 0 0 0
In [ ]: # using read_csv
movies = pd.read_csv('movies.csv')
movies
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/
Strike
Battalion
1 tt9472208 NaN https://en.wikipedia.
609
The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/w
Minister
(film)
Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org
India
Evening
4 tt6028796 NaN https://en.wikipedia.org/
Shadows
Tera Mera
1624 Saath tt0301250 https://upload.wikimedia.org/wikipedia/en/2/2b... https://en.wikipedia.org/w
Rahen
Yeh
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/w
Ka Safar
Sabse
1626 Bada tt0069204 NaN https://en.wikipedia.org/
Sukh
In [ ]: ipl = pd.read_csv('ipl-matches.csv')
ipl
Narendra
2022- Rajasthan Gujarat Modi
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadium,
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadium,
Bangalore
Ahmedabad
Eden
2022- Rajasthan Gujarat
3 1312197 Kolkata 2022 Qualifier 1 Gardens,
05-24 Royals Titans
Kolkata
Wankhede
2022- Sunrisers Punjab
4 1304116 Mumbai 2022 70 Stadium,
05-22 Hyderabad Kings
Mumbai
Kolkata
2008- Deccan Eden
945 335986 Kolkata 2007/08 4 Knight
04-20 Chargers Gardens
Riders
Royal
2008- Mumbai Wankhede
946 335985 Mumbai 2007/08 5 Challengers
04-20 Indians Stadium
Bangalore
Punjab
Chennai Cricket
2008- Kings XI
948 335983 Chandigarh 2007/08 2 Super Association
04-19 Punjab
Kings Stadium,
Mohali
Royal Kolkata M
2008-
949 335982 Bangalore 2007/08 1 Challengers Knight Chinnaswamy
04-18
Bangalore Riders Stadium
You can select specific columns from a DataFrame using square brackets. For instance, in the
code, the movies['title_x'] expression selects the 'title_x' column from the 'movies'
DataFrame.
In [ ]: ipl['Venue']
In [ ]: students['package']
name
Out[ ]:
nitish 10
ankit 7
rupesh 14
rishabh 2
amit 0
ankita 0
Name: package, dtype: int64
In [ ]: # multiple cols
movies[['year_of_release','actors','title_x']]
Emraan Hashmi|Shreya
3 2019 Why Cheat India
Dhanwanthary|Snighdadeep ...
In [ ]: ipl[['Team1','Team2','WinningTeam']]
949 Royal Challengers Bangalore Kolkata Knight Riders Kolkata Knight Riders
iloc - uses integer-based indexing, and you can select rows by their index positions.
loc - uses label-based indexing, and you can select rows by their index labels.
In [ ]: # single row
movies.iloc[5]
In [ ]: # multiple row
movies.iloc[:5]
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/U
Strike
Battalion
1 tt9472208 NaN https://en.wikipedia.org/w
609
The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/T
Minister
(film)
Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/
India
Evening
4 tt6028796 NaN https://en.wikipedia.org/wiki/E
Shadows
In [ ]: # fancy indexing
movies.iloc[[0,4,5]]
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Uri
Strike
Evening
4 tt6028796 NaN https://en.wikipedia.org/wiki/Eve
Shadows
Soni
5 tt6078866 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/w
(film)
In [ ]: # loc
students
name
nitish 100 80 10
ankit 90 70 7
rishabh 80 50 2
amit 0 0 0
ankita 0 0 0
In [ ]: students.loc['nitish']
iq 100
Out[ ]:
marks 80
package 10
Name: nitish, dtype: int64
In [ ]: students.loc['nitish':'rishabh']
name
nitish 100 80 10
ankit 90 70 7
rishabh 80 50 2
In [ ]: students.loc[['nitish','ankita','rupesh']]
name
nitish 100 80 10
ankita 0 0 0
In [ ]: students.iloc[[0,3,4]]
name
nitish 100 80 10
rishabh 80 50 2
amit 0 0 0
In [ ]: # iloc
movies.iloc[0:3,0:3]
movies.iloc[0:3, 0:3] selects the first three rows and first three columns of the
'movies' DataFrame.
In [ ]: # iloc
movies.loc[0:2,'title_x':'poster_path']
These are fundamental operations when working with Pandas DataFrames. They are useful
for data manipulation and analysis, allowing you to extract specific information from your
data.
Filtering a DataFrame
In [ ]: import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
In [ ]: ipl = pd.read_csv('ipl-matches.csv')
In [ ]: ipl.head(3)
Narendra
2022- Rajasthan Gujarat Modi Raja
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadium, R
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Raja
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadium, R
Bangalore
Ahmedabad
In [ ]: ipl[ipl['MatchNumber'] == 'Final'][['Season','WinningTeam']]
In [ ]: ipl[ipl['SuperOver'] == 'Y'].shape[0]
14
Out[ ]:
5
Out[ ]:
51.473684210526315
Out[ ]:
Movies Dataset
In [ ]: movies = pd.read_csv('movies.csv')
In [ ]: movies.head(3)
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/U
Strike
Battalion
1 tt9472208 NaN https://en.wikipedia.org/w
609
The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/T
Minister
(film)
0
Out[ ]:
In [ ]: mask1 = movies['genres'].str.contains('Action')
mask2 = movies['imdb_rating'] > 7.5
data['title_x']
In [ ]: import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
1 . value_counts
it is used in series as well as Dataframe in pandas
In [ ]: marks = pd.DataFrame([
[100,80,10],
[90,70,7],
[120,100,14],
[80,70,14],
[80,70,14]
],columns=['iq','marks','package'])
marks
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
4 80 70 14
In [ ]: marks.value_counts()
iq marks package
Out[ ]:
80 70 14 2
90 70 7 1
100 80 10 1
120 100 14 1
Name: count, dtype: int64
In [ ]: ipl = pd.read_csv('ipl-matches.csv')
In [ ]: ipl.head(2)
Narendra
2022- Rajasthan Gujarat Modi Raja
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadium, R
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Raja
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadium, R
Bangalore
Ahmedabad
In [ ]: # find which player has won most player of the match -> in finals and qualifiers
ipl[~ipl['MatchNumber'].str.isdigit()]['Player_of_Match'].value_counts()
TossDecision
Out[ ]:
field 599
bat 351
Name: count, dtype: int64
In [ ]: ipl['TossDecision'].value_counts().plot(kind='pie')
<Axes: ylabel='count'>
Out[ ]:
2. sort_values
In [ ]: students = pd.DataFrame(
{
'name':['nitish','ankit','rupesh',np.nan,'mrityunjay',np.nan,'rishabh',np.n
'college':['bit','iit','vit',np.nan,np.nan,'vlsi','ssit',np.nan,np.nan,'git
'branch':['eee','it','cse',np.nan,'me','ce','civ','cse','bio',np.nan],
'cgpa':[6.66,8.25,6.41,np.nan,5.6,9.0,7.4,10,7.4,np.nan],
'package':[4,5,6,np.nan,6,7,8,9,np.nan,np.nan]
}
)
students
In [ ]: students.sort_values('name')
In [ ]: students.sort_values('name',na_position='first',ascending=False)
marks_series = pd.Series(marks)
marks_series
maths 67
Out[ ]:
english 57
science 89
hindi 100
dtype: int64
In [ ]: marks_series.sort_index(ascending=False)
In [ ]: movies.head(1)
Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Uri:_
Strike
In [ ]: movies.set_index('title_x',inplace=True)
In [ ]: movies.rename(columns={'imdb_id':'imdb','poster_path':'link'},inplace=True)
In [ ]: movies.head(1)
title_x
Uri: The
Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Uri:_Th
Strike
In [ ]: ipl['Season'].nunique()
15
Out[ ]:
6. isnull(series + dataframe)
In [ ]: students['name'][students['name'].isnull()]
In [ ]: # notnull(series + dataframe)
students['name'][students['name'].notnull()]
0 nitish
Out[ ]:
1 ankit
2 rupesh
4 mrityunjay
6 rishabh
8 aditya
Name: name, dtype: object
In [ ]: # hasnans(series)
students['name'].hasnans
True
Out[ ]:
In [ ]: students
7. dropna
In [ ]: students['name'].dropna()
0 nitish
Out[ ]:
1 ankit
2 rupesh
4 mrityunjay
6 rishabh
8 aditya
Name: name, dtype: object
In [ ]: students
In [ ]: students.dropna(how='any')
In [ ]: students.dropna(how='all')
In [ ]: students.dropna(subset=['name'])
In [ ]: students.dropna(subset=['name','college'])
8. fillna(series + dataframe)
In [ ]: students['name'].fillna('unknown')
0 nitish
Out[ ]:
1 ankit
2 rupesh
3 unknown
4 mrityunjay
5 unknown
6 rishabh
7 unknown
8 aditya
9 unknown
Name: name, dtype: object
In [ ]: students['package'].fillna(students['package'].mean())
0 4.000000
Out[ ]:
1 5.000000
2 6.000000
3 6.428571
4 6.000000
5 7.000000
6 8.000000
7 9.000000
8 6.428571
9 6.428571
Name: package, dtype: float64
In [ ]: students['name'].fillna(method='bfill')
# Use the drop_duplicates method to remove duplicate rows based on all columns
df_no_duplicates = df.drop_duplicates()
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Los Angeles
3 Alice 25 New York
4 David 28 Chicago
10 .drop(series + dataframe)
In [ ]: students
In [ ]: students.drop(columns=['branch','cgpa'],inplace=True)
In [ ]: students
11. apply
In [ ]: points_df = pd.DataFrame(
{
'1st point':[(3,4),(-6,5),(0,0),(-10,1),(4,5)],
'2nd point':[(-3,4),(0,0),(2,2),(10,10),(1,1)]
}
)
points_df
0 (3, 4) (-3, 4)
1 (-6, 5) (0, 0)
2 (0, 0) (2, 2)
4 (4, 5) (1, 1)
In [ ]: def euclidean(row):
pt_A = row['1st point']
pt_B = row['2nd point']
In [ ]: points_df['distance'] = points_df.apply(euclidean,axis=1)
points_df
When you apply groupby() to a DataFrame, it creates a GroupBy object, which acts as a
kind of intermediate step before applying aggregation functions or other operations to the
grouped data. This intermediate step helps you perform operations on subsets of data
based on the grouping criteria. Some common aggregation functions you can apply to a
GroupBy object include sum() , mean() , count() , max() , min() , and more.
Here's a basic example of how you can create a GroupBy object and perform aggregation
with it:
In [ ]: import pandas as pd
import numpy as np
df = pd.DataFrame(data)
Category
A 55
B 45
Name: Value, dtype: int64
Practical Use
In [ ]: movies = pd.read_csv('Data\Day35\imdb-top-1000.csv')
In [ ]: movies.head(3)
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 2343110
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 1620367
Godfather Brando
Coppola
In [ ]: genres.sum(3)
Genre
In [ ]: genres.mean(3)
Genre
Genre
Out[ ]:
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64
In [ ]: movies.groupby('Genre').sum()['Gross'].sort_values(ascending=False).head(3)
Genre
Out[ ]:
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64
Genre
Out[ ]:
Western 8.35
Name: IMDB_Rating, dtype: float64
Star1
Out[ ]:
Tom Hanks 12
Robert De Niro 11
Clint Eastwood 10
Al Pacino 10
Leonardo DiCaprio 9
..
Glen Hansard 1
Giuseppe Battiston 1
Giulietta Masina 1
Gerardo Taracena 1
Ömer Faruk Sorak 1
Name: Series_Title, Length: 660, dtype: int64
In [ ]: import numpy as np
import pandas as pd
movies = pd.read_csv('Data\Day35\imdb-top-1000.csv')
In [ ]: genres = movies.groupby('Genre')
1. len
In [ ]: len(movies.groupby('Genre'))
14
Out[ ]:
2. nunique
In [ ]: movies['Genre'].nunique()
14
Out[ ]:
3. size
In [ ]: movies.groupby('Genre').size()
4. nth
In [ ]: genres = movies.groupby('Genre')
# genres.first()
# genres.last()
genres.nth(6)
Star Wars:
Episode V - Irvin Mark
16 1980 124 Action 8.7 1159
The Empire Kershner Hamill
Strikes Back
David Morgan
27 Se7en 1995 127 Crime 8.6 1445
Fincher Freeman
It's a
Frank James
32 Wonderful 1946 130 Drama 8.6 405
Capra Stewart
Life
Andrew Ben
66 WALL·E 2008 98 Animation 8.4 999
Stanton Burtt
Mel
102 Braveheart 1995 178 Biography 8.3 Mel Gibson 959
Gibson
Joseph L. Laurence
420 Sleuth 1972 138 Mystery 8.0 44
Mankiewicz Olivier
Jordan Daniel
724 Get Out 2017 104 Horror 7.7 492
Peele Kaluuya
5. get_group
In [ ]: genres.get_group('Fantasy')
Das
Robert Werner
321 Cabinet des 1920 76 Fantasy 8.1 57428 3
Wiene Krauss
Dr. Caligari
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9 88794 4
Murnau Schreck
Das
Robert Werner
321 Cabinet des 1920 76 Fantasy 8.1 57428 3
Wiene Krauss
Dr. Caligari
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9 88794 4
Murnau Schreck
6. describe
In [ ]: genres.describe()
count mean std min 25% 50% 75% max count mean ...
Genre
Action 172.0 129.046512 28.500706 45.0 110.75 127.5 143.25 321.0 172.0 7.949419 ...
Adventure 72.0 134.111111 33.317320 88.0 109.00 127.0 149.00 228.0 72.0 7.937500 ...
Animation 82.0 99.585366 14.530471 71.0 90.00 99.5 106.75 137.0 82.0 7.930488 ...
Biography 88.0 136.022727 25.514466 93.0 120.00 129.0 146.25 209.0 88.0 7.938636 ...
Comedy 155.0 112.129032 22.946213 68.0 96.00 106.0 124.50 188.0 155.0 7.901290 ...
Crime 107.0 126.392523 27.689231 80.0 106.50 122.0 141.50 229.0 107.0 8.016822 ...
Drama 289.0 124.737024 27.740490 64.0 105.00 121.0 137.00 242.0 289.0 7.957439 ...
Family 2.0 107.500000 10.606602 100.0 103.75 107.5 111.25 115.0 2.0 7.800000 ...
Fantasy 2.0 85.000000 12.727922 76.0 80.50 85.0 89.50 94.0 2.0 8.000000 ...
Film-Noir 3.0 104.000000 4.000000 100.0 102.00 104.0 106.00 108.0 3.0 7.966667 ...
Horror 11.0 102.090909 13.604812 71.0 98.00 103.0 109.00 122.0 11.0 7.909091 ...
Mystery 12.0 119.083333 14.475423 96.0 110.75 117.5 130.25 138.0 12.0 7.975000 ...
Thriller 1.0 108.000000 NaN 108.0 108.00 108.0 108.00 108.0 1.0 7.800000 ...
Western 4.0 148.250000 17.153717 132.0 134.25 148.0 162.00 165.0 4.0 8.350000 ...
14 rows × 40 columns
7. sample
In [ ]: genres.sample(2,replace=True)
The
Willem
648 Boondock 1999 108 Action 7.8 Troy Duffy
Dafoe
Saints
Aaron
Matthew
908 Kick-Ass 2010 117 Action 7.6 Taylor-
Vaughn
Johnson
Lee Adrian
61 Coco 2017 105 Animation 8.4
Unkrich Molina
Satoshi Megumi
758 Papurika 2006 90 Animation 7.7
Kon Hayashibara
328 Lion 2016 118 Biography 8.0 Garth Davis Dev Patel
Predrag
Emir
256 Underground 1995 170 Comedy 8.1 'Miki'
Kusturica
Manojlovic
Stanley Sterling
441 The Killing 1956 84 Crime 8.0
Kubrick Hayden
Brokeback Jake
773 2005 134 Drama 7.7 Ang Lee
Mountain Gyllenhaal
Willy Wonka
& the Gene
698 1971 100 Family 7.8 Mel Stuart
Chocolate Wilder
Factory
E.T. the
Steven Henry
688 Extra- 1982 115 Family 7.8
Spielberg Thomas
Terrestrial
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9
Murnau Schreck
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1
Wiene Krauss
Caligari
932 Saw 2004 103 Horror 7.6 James Wan Cary Elwes
Joseph L. Laurence
420 Sleuth 1972 138 Mystery 8.0
Mankiewicz Olivier
Alfred James
119 Vertigo 1958 128 Mystery 8.3
Hitchcock Stewart
Il buono, il
Sergio Clint
12 brutto, il 1966 161 Western 8.8
Leone Eastwood
cattivo
8. nunique()
In [ ]: genres.nunique()
Genre
Adventure 72 49 58 10 59 59 72 72
Animation 82 35 41 11 51 77 82 82
Biography 88 44 56 13 76 72 88 88
Family 2 2 2 1 2 2 2 2
Fantasy 2 2 2 2 2 2 2 2
Film-Noir 3 3 3 3 3 3 3 3
Horror 11 11 10 8 10 11 11 11
Mystery 12 11 10 8 10 11 12 12
Thriller 1 1 1 1 1 1 1 1
Western 4 4 4 4 2 2 4 4
9. agg method
In [ ]: # passing dict
genres.agg(
{
file:///C:/Users/disha/Downloads/Day36 - GroupBy object in Pandas(part2).html 6/9
10/25/23, 11:06 AM Day36 - GroupBy object in Pandas(part2)
'Runtime':'mean',
'IMDB_Rating':'mean',
'No_of_Votes':'sum',
'Gross':'sum',
'Metascore':'min'
}
)
Genre
Genre
Genre
Abhishek Aamir
Action 300 1924 45 Action 7.6
Chaubey Khan
2001: A
Akira Aamir
Adventure Space 1925 88 Adventure 7.6
Kurosawa Khan
Odyssey
Adam Adrian
Animation Akira 1940 71 Animation 7.6
Elliot Molina
Aamir Abhay
Drama 1917 1925 64 Drama 7.6
Khan Deol
E.T. the
Gene
Family Extra- 1971 100 Family 7.8 Mel Stuart
Wilder
Terrestrial
Das
F.W. Max
Fantasy Cabinet des 1920 76 Fantasy 7.9
Murnau Schreck
Dr. Caligari
Alejandro Anthony
Horror Alien 1933 71 Horror 7.6
Amenábar Perkins
Bernard-
Alex
Mystery Dark City 1938 96 Mystery 7.6 Pierre
Proyas
Donnadieu
Il buono, il
Clint Clint
Western brutto, il 1965 132 Western 7.8
Eastwood Eastwood
cattivo
Joins In Pandas
In pandas, "joins" refer to the process of combining data from two or more DataFrames
based on a common column or index. There are several types of joins available, which
determine how rows are matched between DataFrames. Let's go into more detail about the
different types of joins and how to perform them in pandas:
1. Inner Join:
An inner join returns only the rows that have matching keys in both DataFrames.
Use the pd.merge() function with the how='inner' parameter or use the
.merge() method with the same parameter to perform an inner join.
Example:
merged_df = pd.merge(left_df, right_df, on='key', how='inner')
2. Left Join (Left Outer Join):
A left join returns all the rows from the left DataFrame and the matching rows from
the right DataFrame. Non-matching rows from the left DataFrame will also be
included.
Use the how='left' parameter with pd.merge() or .merge() to perform a
left join.
Example:
merged_df = pd.merge(left_df, right_df, on='key', how='left')
3. Right Join (Right Outer Join):
A right join is the opposite of a left join. It returns all the rows from the right
DataFrame and the matching rows from the left DataFrame. Non-matching rows
from the right DataFrame will also be included.
Use the how='right' parameter with pd.merge() or .merge() to perform a
right join.
Example:
A full outer join returns all rows from both DataFrames, including both matching
and non-matching rows.
Use the how='outer' parameter with pd.merge() or .merge() to perform a
full outer join.
Example:
merged_df = pd.merge(left_df, right_df, on='key', how='outer')
5. Join on Multiple Columns:
You can perform joins on multiple columns by passing a list of column names to
the on parameter.
Example:
merged_df = pd.merge(left_df, right_df, on=['key1', 'key2'],
how='inner')
6. Join on Index:
You can join DataFrames based on their indices using the left_index and
right_index parameters set to True .
Example:
merged_df = pd.merge(left_df, right_df, left_index=True,
right_index=True, how='inner')
7. Suffixes:
If DataFrames have columns with the same name, you can specify suffixes to
differentiate them in the merged DataFrame using the suffixes parameter.
Example:
merged_df = pd.merge(left_df, right_df, on='key', how='inner',
suffixes=('_left', '_right'))
Joins in pandas are a powerful way to combine and analyze data from multiple sources. It's
important to understand the structure of your data and the requirements of your analysis to
choose the appropriate type of join. You can also use the .join() method if you want to
join DataFrames based on their indices or use pd.concat() to stack DataFrames without
performing a join based on columns or indices.
In [ ]: import pandas as pd
import numpy as np
dec = pd.read_csv('Data\Day37\Dec.csv')
matches = pd.read_csv('Data\Day37\matches.csv')
delivery = pd.read_csv('Data\Day37\deliveries.csv')
In [ ]: students = pd.read_csv('Data\Day37\student.csv')
In [ ]: nov = pd.read_csv('Nov.csv')
Concat
In [ ]: pd.concat([nov,dec],axis=1)
0 23.0 1.0 3 5
1 15.0 5.0 16 7
2 18.0 6.0 12 10
3 23.0 4.0 12 1
4 16.0 9.0 14 9
5 18.0 1.0 7 7
6 1.0 1.0 7 2
7 7.0 8.0 16 3
8 22.0 3.0 17 10
9 15.0 1.0 11 8
10 19.0 4.0 14 6
11 1.0 6.0 12 5
12 7.0 10.0 12 7
13 11.0 7.0 18 8
14 13.0 3.0 1 10
15 24.0 4.0 1 9
16 21.0 1.0 2 5
17 16.0 5.0 7 6
18 23.0 3.0 22 5
19 17.0 7.0 22 6
20 23.0 6.0 23 9
21 25.0 1.0 23 5
22 19.0 2.0 14 4
23 25.0 10.0 14 1
24 3.0 3.0 11 10
25 NaN NaN 42 9
26 NaN NaN 50 8
27 NaN NaN 38 1
In [ ]: regs = pd.concat([nov,dec],ignore_index=True)
regs.head(2)
0 23 1
1 15 5
Inner join
In [ ]: inner = students.merge(regs,how='inner',on='student_id')
inner.head()
0 1 Kailash Harjo 23 1
1 1 Kailash Harjo 23 6
2 1 Kailash Harjo 23 10
3 1 Kailash Harjo 23 9
4 2 Esha Butala 1 5
left join
In [ ]: left = courses.merge(regs,how='left',on='course_id')
left.head()
right join
In [ ]: temp_df = pd.DataFrame({
'student_id':[26,27,28],
'name':['Nitish','Ankit','Rahul'],
'partner':[28,26,17]
})
students = pd.concat([students,temp_df],ignore_index=True)
In [ ]: students.head()
0 1 Kailash Harjo 23
1 2 Esha Butala 1
2 3 Parveen Bhalla 3
3 4 Marlo Dugal 14
4 5 Kusum Bahri 6
outer join
In [ ]: students.merge(regs,how='outer',on='student_id').tail(10)
dec = pd.read_csv('Data\Day37\Dec.csv')
matches = pd.read_csv('Data\Day37\matches.csv')
delivery = pd.read_csv('Data\Day37\deliveries.csv')
In [ ]: students = pd.read_csv('Data\Day37\student.csv')
In [ ]: nov = pd.read_csv('Nov.csv')
0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
In [ ]: total = regs.merge(courses,how='inner',on='course_id')['price'].sum()
total
154247
Out[ ]:
level_0
Out[ ]:
Dec 65072
Nov 89175
Name: price, dtype: int64
<Axes: xlabel='course_name'>
Out[ ]:
In [ ]: students[students['student_id'].isin(common_student_id)]
0 1 Kailash Harjo 23
2 3 Parveen Bhalla 3
6 7 Tarun Thaker 9
10 11 David Mukhopadhyay 20
15 16 Elias Dodiya 25
16 17 Yasmin Palan 7
17 18 Fardeen Mahabir 13
21 22 Yash Sethi 21
22 23 Chhavi Lachman 18
10 11 Numpy 699
11 12 C++ 1299
(10/28)*100
35.714285714285715
Out[ ]:
student_id name
Out[ ]:
23 Chhavi Lachman 6
7 Tarun Thaker 5
1 Kailash Harjo 4
Name: name, dtype: int64
In [ ]: regs.merge(students,on='student_id').merge(courses,on='course_id').groupby(['studen
student_id name
Out[ ]:
23 Chhavi Lachman 22594
14 Pranab Natarajan 15096
19 Qabeel Raman 13498
Name: price, dtype: int64
1. MultiIndex in Series:
In a Series, a multiindex allows you to have multiple levels of row labels.
You can think of it as having subcategories or subgroups for the data in your Series.
To create a multiindex Series, you can use the pd.MultiIndex.from_tuples ,
pd.MultiIndex.from_arrays , or other constructors.
In [ ]: import numpy as np
import pandas as pd
In [ ]: # 1. pd.MultiIndex.from_tuples()
index_val = [('cse',2019),('cse',2020),('cse',2021),('cse',2022),('ece',2019),('ece
multiindex = pd.MultiIndex.from_tuples(index_val)
In [ ]: multiindex
MultiIndex([('cse', 2019),
Out[ ]:
('cse', 2020),
('cse', 2021),
('cse', 2022),
('ece', 2019),
('ece', 2020),
('ece', 2021),
('ece', 2022)],
)
In [ ]: # 2. pd.MultiIndex.from_product()
pd.MultiIndex.from_product([['cse','ece'],[2019,2020,2021,2022]])
MultiIndex([('cse', 2019),
Out[ ]:
('cse', 2020),
('cse', 2021),
('cse', 2022),
('ece', 2019),
('ece', 2020),
('ece', 2021),
('ece', 2022)],
)
2019 1
Out[ ]:
2020 2
2021 3
2022 4
dtype: int64
2. MultiIndex in DataFrames:
In a DataFrame, a multiindex allows you to have hierarchical row and column labels.
You can think of it as having multiple levels of row and column headers, which is useful
when dealing with multi-dimensional data.
To create a multiindex DataFrame, you can use pd.MultiIndex.from_tuples ,
pd.MultiIndex.from_arrays , or construct it directly when creating the DataFrame.
In [ ]: branch_df1 = pd.DataFrame(
[
[1,2],
[3,4],
[5,6],
[7,8],
[9,10],
[11,12],
[13,14],
[15,16],
],
index = multiindex,
columns = ['avg_package','students']
)
branch_df1
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
branch_df2
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df2.loc[2019]
delhi avg_package 1
Out[ ]:
students 2
mumbai avg_package 0
students 0
Name: 2019, dtype: int64
branch_df3 = pd.DataFrame(
[
[1,2,0,0],
[3,4,0,0],
[5,6,0,0],
[7,8,0,0],
[9,10,0,0],
[11,12,0,0],
[13,14,0,0],
[15,16,0,0],
],
index = multiindex,
columns = pd.MultiIndex.from_product([['delhi','mumbai'],['avg_package','studen
)
branch_df3
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
MultiIndexes allow you to represent and manipulate complex, multi-level data structures
efficiently in pandas, making it easier to work with and analyze data that has multiple
dimensions or hierarchies. You can perform various operations and selections on multiindex
objects to access and manipulate specific levels of data within your Series or DataFrame.
"Stacking" and "unstacking" are operations that you can perform on multi-indexed
DataFrames to change the arrangement of the data, essentially reshaping the data between
a wide and a long format (or vice versa).
1. Stacking:
Stacking is the process of "melting" or pivoting the innermost level of column labels to
become the innermost level of row labels.
This operation is typically used when you want to convert a wide DataFrame with multi-
level columns into a long format.
You can use the .stack() method to perform stacking. By default, it will stack the
innermost level of columns.
In [ ]: import numpy as np
import pandas as pd
A B
0 X 0.960684 0.900984
Y 0.118538 0.485585
1 X 0.946716 0.444658
Y 0.049913 0.991469
2 X 0.656110 0.759727
Y 0.158270 0.203801
3 X 0.360581 0.797212
Y 0.965035 0.102426
2. Unstacking:
Unstacking is the reverse operation of stacking. It involves pivoting the innermost level
of row labels to become the innermost level of column labels.
You can use the .unstack() method to perform unstacking. By default, it will unstack
the innermost level of row labels.
Example:
A B
Join Our WhatsApp forX Updates: https://lnkd.in/gEXBtVBA
Y X Y
Join Our 0Telegram
0.960684Updates:
for https://lnkd.in/gEpetzaw
0.118538 0.900984 0.485585
1 0.946716 0.049913 0.444658 0.991469
2 0.656110 0.158270 0.759727 0.203801
3 0.360581 0.965035 0.797212 0.102426
You can specify the level you want to stack or unstack by passing the level parameter to
the stack() or unstack() methods. For example:
Out[ ]: A B
0 X 0.960684 0.900984
Y 0.118538 0.485585
1 X 0.946716 0.444658
Y 0.049913 0.991469
2 X 0.656110 0.759727
Y 0.158270 0.203801
3 X 0.360581 0.797212
Y 0.965035 0.102426
Out[ ]: A B
0 1 2 3 0 1 2 3
In [ ]: index_val = [('cse',2019),('cse',2020),('cse',2021),('cse',2022),('ece',2019),('ece
multiindex = pd.MultiIndex.from_tuples(index_val)
multiindex.levels[1]
In [ ]: branch_df1 = pd.DataFrame(
[
[1,2],
[3,4],
[5,6],
[7,8],
[9,10],
[11,12],
[13,14],
[15,16],
],
index = multiindex,
columns = ['avg_package','students']
)
branch_df1
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
branch_df2
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df1
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [ ]: branch_df1.unstack().unstack()
In [ ]: branch_df1.unstack().stack()
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [ ]: branch_df2
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df2.stack()
2019 avg_package 1 0
students 2 0
2020 avg_package 3 0
students 4 0
2021 avg_package 5 0
students 6 0
2022 avg_package 7 0
students 8 0
In [ ]: branch_df2.stack().stack()
Stacking and unstacking can be very useful when you need to reshape your data to make it
more suitable for different types of analysis or visualization. They are common operations in
data manipulation when working with multi-indexed DataFrames in pandas.
In [ ]: branch_df = pd.DataFrame(
[
[1,2,0,0],
[3,4,0,0],
[5,6,0,0],
[7,8,0,0],
[9,10,0,0],
[11,12,0,0],
[13,14,0,0],
[15,16,0,0],
],
index = multiindex,
columns = pd.MultiIndex.from_product([['delhi','mumbai'],['avg_package','studen
)
branch_df
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
Basic Checks
In [ ]: # HEAD
branch_df.head()
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
In [ ]: # Tail
branch_df.tail()
cse 2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [ ]: #shape
branch_df.shape
(8, 4)
Out[ ]:
In [ ]: # info
branch_df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8 entries, ('cse', 2019) to ('ece', 2022)
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (delhi, avg_package) 8 non-null int64
1 (delhi, students) 8 non-null int64
2 (mumbai, avg_package) 8 non-null int64
3 (mumbai, students) 8 non-null int64
dtypes: int64(4)
memory usage: 632.0+ bytes
In [ ]: # duplicated
branch_df.duplicated().sum()
0
Out[ ]:
In [ ]: # isnull
branch_df.isnull().sum()
delhi avg_package 0
Out[ ]:
students 0
mumbai avg_package 0
students 0
dtype: int64
How to Extract
delhi avg_package 7
Out[ ]:
students 8
mumbai avg_package 0
students 0
Name: (cse, 2022), dtype: int64
In [ ]: branch_df
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
In [ ]: # using iloc
branch_df.iloc[2:5]
cse 2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
In [ ]: branch_df.iloc[2:8:2]
cse 2021 5 6 0 0
ece 2019 9 10 0 0
2021 13 14 0 0
In [ ]: # extacting cols
branch_df['delhi']['students']
cse 2019 2
Out[ ]:
2020 4
2021 6
2022 8
ece 2019 10
2020 12
2021 14
2022 16
Name: students, dtype: int64
In [ ]: branch_df.iloc[:,1:3]
students avg_package
cse 2019 2 0
2020 4 0
2021 6 0
2022 8 0
ece 2019 10 0
2020 12 0
2021 14 0
2022 16 0
In [ ]: # Extracting both
branch_df.iloc[[0,4],[1,2]]
students avg_package
cse 2019 2 0
ece 2019 10 0
Sorting
In [ ]: branch_df.sort_index(ascending=False)
ece 2022 15 16 0 0
2021 13 14 0 0
2020 11 12 0 0
2019 9 10 0 0
cse 2022 7 8 0 0
2021 5 6 0 0
2020 3 4 0 0
2019 1 2 0 0
In [ ]: branch_df.sort_index(ascending=[False,True])
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df.sort_index(level=0,ascending=[False])
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
delhi avg_package 1 3 5 7 9 11 13 15
students 2 4 6 8 10 12 14 16
mumbai avg_package 0 0 0 0 0 0 0 0
students 0 0 0 0 0 0 0 0
In [ ]: # swaplevel
branch_df.swaplevel(axis=1)
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [ ]: branch_df.swaplevel()
2019 cse 1 2 0 0
2020 cse 3 4 0 0
2021 cse 5 6 0 0
2022 cse 7 8 0 0
2019 ece 9 10 0 0
2020 ece 11 12 0 0
2021 ece 13 14 0 0
2022 ece 15 16 0 0
"Long" and "wide" are terms often used in data analysis and data reshaping in the context of
data frames or tables, typically in software like R or Python. They describe two different ways
of organizing and structuring data.
Long Format:
ID Variable Value
1 Age 25
1 Height 175
1 Weight 70
2 Age 30
2 Height 160
2 Weight 60
Wide Format:
1 25 175 70
2 30 160 60
Converting data between long and wide formats is often necessary depending on the
specific analysis or visualization tasks you want to perform. In software like R and Python,
there are functions and libraries available for reshaping data between these formats, such as
tidyr in R and pivot functions in Python's pandas library for moving from wide to long
format, and gather in R and melt in pandas for moving from long to wide format.
In [ ]: import numpy as np
import pandas as pd
In [ ]: pd.DataFrame({'cse':[120]})
Out[ ]: cse
0 120
In [ ]: pd.DataFrame({'cse':[120]}).melt()
0 cse 120
0 120 100 50
0 cse 120
1 ece 100
2 mech 50
0 cse 120
1 ece 100
2 mech 50
In [ ]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
)
2 mech 60 80 70
In [ ]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
).melt()
0 branch cse
1 branch ece
2 branch mech
3 2020 100
4 2020 150
5 2020 60
6 2021 120
7 2021 130
8 2021 80
9 2022 150
10 2022 140
11 2022 70
2 mech 2020 60
5 mech 2021 80
8 mech 2022 70
Real-World Example:
In the context of COVID-19 data, data for deaths and confirmed cases are initially stored
in wide formats.
The data is converted to long format, making it easier to conduct analyses.
In the long format, each row represents a specific location, date, and the corresponding
number of deaths or confirmed cases. This format allows for efficient merging and
analysis, as it keeps related data in one place and facilitates further data exploration.
In [ ]: death.head()
Out[ ]: Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/2
In [ ]: confirm.head()
Out[ ]: Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/2
In [ ]: death = death.melt(id_vars=['Province/State','Country/Region','Lat','Long'],var_nam
confirm = confirm.melt(id_vars=['Province/State','Country/Region','Lat','Long'],var
In [ ]: death.head()
In [ ]: confirm.head()
In [ ]: confirm.merge(death,on=['Province/State','Country/Region','Lat','Long','date'])
Winter Olympics
311249 NaN 39.904200 116.407400 1/2/23 535 0
2022
In [ ]: confirm.merge(death,on=['Province/State','Country/Region','Lat','Long','date'])[['C
0 Afghanistan 1/22/20 0 0
1 Albania 1/22/20 0 0
2 Algeria 1/22/20 0 0
3 Andorra 1/22/20 0 0
4 Angola 1/22/20 0 0
The choice between long and wide data formats depends on the nature of the dataset and
the specific analysis or visualization tasks you want to perform. Converting data between
these formats can help optimize data organization for different analytical needs.
In [ ]: import numpy as np
import pandas as pd
import seaborn as sns
In [ ]: # Sample data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250],
}
In [ ]: print(pivot_table)
Product A B
Date
2023-01-01 100 200
2023-01-02 150 250
In this example, we first create a DataFrame using sample data. Then, we use the
pd.pivot_table function to create a pivot table. Here's what each argument does:
Real-world Examples
In [ ]: df = sns.load_dataset('tips')
In [ ]: df.head()
In [ ]: df.pivot_table(index='sex',columns='smoker',values='total_bill')
sex
In [ ]: ## aggfunc# aggfunc
df.pivot_table(index='sex',columns='smoker',values='total_bill',aggfunc='std')
sex
sex
In [ ]: # multidimensional
df.pivot_table(index=['sex','smoker'],columns=['day','time'],aggfunc={'size':'mean'
Out[ ]: size
time Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch D
sex smoker
Male Yes 2.300000 NaN 1.666667 2.4 2.629630 2.600000 5.00 NaN 2.20
Female Yes 2.428571 NaN 2.000000 2.0 2.200000 2.500000 5.00 NaN 3.48
In [ ]: # margins
df.pivot_table(index='sex',columns='smoker',values='total_bill',aggfunc='sum',margi
sex
Plotting graph
In [ ]: df = pd.read_csv('Data\Day43\expense_data.csv')
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amou
CUB -
3/2/2022
0 online Food NaN Brownie 50.0 Expense NaN 50
10:11
payment
CUB - To
3/2/2022
1 online Other NaN lended 300.0 Expense NaN 300
10:11
payment people
CUB -
3/1/2022
2 online Food NaN Dinner 78.0 Expense NaN 78
19:50
payment
CUB -
3/1/2022
3 online Transportation NaN Metro 30.0 Expense NaN 30
18:56
payment
CUB -
3/1/2022
4 online Food NaN Snacks 67.0 Expense NaN 67
18:22
payment
In [ ]: df['Category'].value_counts()
In [ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null object
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: float64(5), object(6)
memory usage: 23.9+ KB
In [ ]: df['Date'] = pd.to_datetime(df['Date'])
In [ ]: df['month'] = df['Date'].dt.month_name()
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amoun
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67
18:22:00 payment
In [ ]: df.pivot_table(index='month',columns='Income/Expense',values='INR',aggfunc='sum',fi
<Axes: xlabel='month'>
Out[ ]:
In [ ]: df.pivot_table(index='month',columns='Account',values='INR',aggfunc='sum',fill_valu
<Axes: xlabel='month'>
Out[ ]:
Vectorized string operations in Pandas refer to the ability to apply string functions and
operations to entire arrays of strings (columns or Series containing strings) without the
need for explicit loops or iteration. This is made possible by Pandas' integration with the
NumPy library, which allows for efficient element-wise operations.
When you have a Pandas DataFrame or Series containing string data, you can use
various string methods that are applied to every element in the column simultaneously.
This can significantly improve the efficiency and readability of your code. Some of the
commonly used vectorized string operations in Pandas include methods like
.str.lower() , .str.upper() , .str.strip() , .str.replace() , and many
more.
Vectorized string operations not only make your code more concise and readable but
also often lead to improved performance compared to explicit for-loops, especially
when dealing with large datasets.
In [ ]: import numpy as np
import pandas as pd
In [ ]: s = pd.Series(['cat','mat',None,'rat'])
0 True
Out[ ]:
1 False
2 None
3 False
dtype: object
In [ ]: df.head()
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [ ]: df['Name']
Common Functions
In [ ]: # lower/upper/capitalize/title
df['Name'].str.upper()
df['Name'].str.capitalize()
df['Name'].str.title()
In [ ]: # len
df['Name'][df['Name'].str.len()]
In [ ]: df['Name'][df['Name'].str.len() == 82].values[0]
'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallej
Out[ ]:
o)'
In [ ]: # strip
df['Name'].str.strip()
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [ ]: df[['title','firstname']] = df['Name'].str.split(',').str.get(1).str.strip().str.sp
df.head()
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [ ]: df['title'].value_counts()
In [ ]: # replace
df['title'] = df['title'].str.replace('Ms.','Miss.')
df['title'] = df['title'].str.replace('Mlle.','Miss.')
In [ ]: df['title'].value_counts()
title
Out[ ]:
Mr. 517
Miss. 185
Mrs. 125
Master. 40
Dr. 7
Rev. 6
Major. 2
Col. 2
Don. 1
Mme. 1
Lady. 1
Sir. 1
Capt. 1
the 1
Jonkheer. 1
Name: count, dtype: int64
filtering
In [ ]: # startswith/endswith
df[df['firstname'].str.endswith('A')]
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Stewart,
Mr. PC
64 65 0 1 male NaN 0 0 27.7208 NaN
Albert 17605
A
Keane,
303 304 1 2 Miss. female NaN 0 0 226593 12.3500 E101
Nora A
In [ ]: # isdigit/isalpha...
df[df['firstname'].str.isdigit()]
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
slicing
In [ ]: df['Name'].str[::-1]
In Pandas, you can work with dates and times using the datetime data type. Pandas
provides several data structures and functions for handling date and time data, making it
convenient for time series data analysis.
In [ ]: import numpy as np
import pandas as pd
1. Timestamp :
This represents a single timestamp and is the fundamental data type for time series data in
Pandas.
Time stamps reference particular moments in time (e.g., Oct 24th, 2022 at 7:00pm)
Timestamp('2023-01-05 00:00:00')
Out[ ]:
In [ ]: # variations
pd.Timestamp('2023-1-5')
pd.Timestamp('2023, 1, 5')
Timestamp('2023-01-05 00:00:00')
Out[ ]:
In [ ]: # only year
pd.Timestamp('2023')
Timestamp('2023-01-01 00:00:00')
Out[ ]:
In [ ]: # using text
pd.Timestamp('5th January 2023')
Timestamp('2023-01-05 09:21:00')
Out[ ]:
x = pd.Timestamp(dt.datetime(2023,1,5,9,21,56))
x
Timestamp('2023-01-05 09:21:56')
Out[ ]:
In [ ]: # fetching attributes
x.year
2023
Out[ ]:
In [ ]: x.month
1
Out[ ]:
In [ ]: x.day
x.hour
x.minute
x.second
56
Out[ ]:
1. Efficiency: The datetime module in Python is flexible and comprehensive, but it may
not be as efficient when dealing with large datasets. Pandas' datetime objects are
optimized for performance and are designed for working with data, making them more
suitable for operations on large time series datasets.
2. Data Alignment: Pandas focuses on data manipulation and analysis, so it provides tools
for aligning data with time-based indices and working with irregular time series. This is
particularly useful in financial and scientific data analysis.
3. Convenience: Pandas provides a high-level API for working with time series data, which
can make your code more concise and readable. It simplifies common operations such
as resampling, aggregation, and filtering.
4. Integration with DataFrames: Pandas seamlessly integrates its date and time objects
with DataFrames. This integration allows you to easily create, manipulate, and analyze
time series data within the context of your data analysis tasks.
5. Time Zones: Pandas has built-in support for handling time zones and daylight saving
time, making it more suitable for working with global datasets and international time
series data.
2. DatetimeIndex :
This is an index that consists of Timestamp objects. It is used to create time series data in
Pandas DataFrames.
In [ ]: # from strings
pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1'])
In [ ]: # from strings
type(pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1']))
pandas.core.indexes.datetimes.DatetimeIndex
Out[ ]:
In [ ]: # using pd.timestamps
dt_index = pd.DatetimeIndex([pd.Timestamp(2023,1,1),pd.Timestamp(2022,1,1),pd.Times
In [ ]: dt_index
pd.Series([1,2,3],index=dt_index)
3. date_range function
In [ ]: # generate daily dates in a given range
pd.date_range(start='2023/1/5',end='2023/2/28',freq='D')
4. to_datetime function
converts an existing objects to pandas timestamp/datetimeindex object
s = pd.Series(['2023/1/1','2022/1/1','2021/1/1'])
pd.to_datetime(s).dt.day_name()
0 Sunday
Out[ ]:
1 Saturday
2 Friday
dtype: object
In [ ]: # with errors
s = pd.Series(['2023/1/1','2022/1/1','2021/130/1'])
pd.to_datetime(s,errors='coerce').dt.month_name()
0 January
Out[ ]:
1 January
2 NaN
dtype: object
In [ ]: df = pd.read_csv('Data\Day43\expense_data.csv')
df.shape
(277, 11)
Out[ ]:
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amou
CUB -
3/2/2022
0 online Food NaN Brownie 50.0 Expense NaN 50
10:11
payment
CUB - To
3/2/2022
1 online Other NaN lended 300.0 Expense NaN 300
10:11
payment people
CUB -
3/1/2022
2 online Food NaN Dinner 78.0 Expense NaN 78
19:50
payment
CUB -
3/1/2022
3 online Transportation NaN Metro 30.0 Expense NaN 30
18:56
payment
CUB -
3/1/2022
4 online Food NaN Snacks 67.0 Expense NaN 67
18:22
payment
In [ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null object
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
Join Our WhatsApp
8 for Updates: https://lnkd.in/gEXBtVBA
Amount 277 non-null float64
Join Our Telegram
9 for Updates: https://lnkd.in/gEpetzaw
Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: float64(5), object(6)
memory usage: 23.9+ KB
In [ ]: df['Date'] = pd.to_datetime(df['Date'])
In [ ]: df.info()
5. dt accessor
Accessor object for datetimelike properties of the Series values.
In [ ]: df['Date'].dt.is_quarter_start
0 False
Out[ ]:
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [ ]: # plot graph
import matplotlib.pyplot as plt
plt.plot(df['Date'],df['INR'])
[<matplotlib.lines.Line2D at 0x181faeba430>]
Out[ ]:
df['day_name'] = df['Date'].dt.day_name()
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amoun
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67
18:22:00 payment
In [ ]: df.groupby('day_name')['INR'].mean().plot(kind='bar')
<Axes: xlabel='day_name'>
Out[ ]:
In [ ]: df['month_name'] = df['Date'].dt.month_name()
In [ ]: df.groupby('month_name')['INR'].sum().plot(kind='bar')
<Axes: xlabel='month_name'>
Out[ ]:
Pandas also provides powerful time series functionality, including the ability to resample,
group, and perform various time-based operations on data. You can work with date and
time data in Pandas to analyze and manipulate time series data effectively.