Panda data structures and its importance in Python.pdf
1.
Sanjivani Rural EducationSociety's
Sanjivani College of Engineering, Kopargaon 423603.
-Department of Strucutral Engineering-
Course Title: (Python-SY & TY B.TECH Structure)
Pandas
By
Mr. Sumit S. Kolapkar (Assistant Professor)
Mail Id- [email protected]
2.
➢ What isData Analysis-
• Is a process of inspecting, cleansing, transforming
and modelling data with the goal of discovering useful
information, informing conclusions and supports
decision making.
➢ Python Libraries for Data Analysis-
3.
➢ What isPandas-
• It has a reference to both “panel data “ and “python
data analysis” and was developed by Mckinney in
2008.
• Used to working with data sets.
• It has a function of analysing, cleaning, exploring and
manipulating data.
• Read and write data in different formats like CSV, Zip,
text, Json,
4.
➢ Excel vs.Pandas-
• Pandas shines when doing analysis work vs Excel
shines for building small applications and
presentation.
• Excel cannot handle more than 1.3~ million records,
and today most of the datasets have more than 2
million rows at least.
• But pandas/python is undoubtedly more powerful
then excel. You can work with more data, faster, and
automate a lot more.
5.
➢ Importance ofPandas in Python-
• Pandas allows us to analyze big data and make
conclusions based on statistical theories.
• Pandas can clean messy data sets, and make them
readable and relevant.
• Easy handling of missing data (represented as NaN)
in both floating point and non-floating point data
• Size mutability: columns can be inserted and deleted
from DataFrames and higher-dimensional objects.
• Data set merging and joining. Flexible reshaping and
pivoting of data sets.
6.
➢ Pandas datastructures- Three types
• Series- One dimensional labelled array and capable
of holding data of any type (integer, string, float etc).
pd.series(data)
• Data frames- Two dimensional data structures with
column just like a table..
• Panel- A 3D container of data.
➢ Installing Pandas-
• pip install pandas
➢ Importing Pandas-
• Import pandas as pd
7.
➢ Series datastructure-
• Is a one dimensional array which capable of storing
various data types.
• Syntax: pandas.Series(data=None, index=None,
dtype=None, name=None, copy=False,
fastpath=False)
• Parameters:
• data: array- Contains data stored in Series.
• index: array-like or Index (1d)
• dtype: str, numpy.dtype, float, or ExtensionDtype,
optional
• name: str, optional
• copy: bool, default False
8.
➢ Series datastructure-
• import pandas as pd
a=pd.Series( )
print(a)
• Example-
import pandas as pd
X = [3,4,5,6,7,8] OUTPUT
var = pd.Series(X) 0 3
print(var) 1 4
print(var[4]) 2 5
3 6
4 7
5 8
dtype:int64
7...value of index 4
9.
➢ Series datastructure-
Example-
import pandas as pd
dic={"name":['python','c','c++','java'],"popularity":[90,65,7
0,85], "rank":[1.0,4.0,3.0,2.0]}
var=pd.Series(dic)
print(var)
• OUTPUT
name [python, c, c++, java]
popularity [90, 65, 70, 85]
rank [1.0, 4.0, 3.0, 2.0]
dtype: object........because we used mixed data type like
string, integer, float etc.
10.
➢ Series datastructure- Change the index
• import pandas as pd
a=pd.Series( )
print(a)
• Example-
import pandas as pd
X = [3,4,5,6]
var = pd.Series(X, index=[“a”, “s”, “d”, “f”])
print(var) OUTPUT
a 3
s 4
d 5
f 6
dtype:int64
11.
➢ Series datastructure- Change the index
• import pandas as pd
a=pd.Series( )
print(a)
• Example-
import pandas as pd
X = [3,4,5,6]
var = pd.Series(X, index=[“a”, “s”, “d”, “f”], dtype=
“float”,name=”python”)
print(var) OUTPUT
a 3.0
s 4.0
d 5.0
f 6.0
Name: python, dtype: float64
12.
➢ Series datastructure- Change the index
Example-
import pandas as pd
X = pd.Series(12,index=[1,2,3,4,5,6,7])
print(X)
OUTPUT
1 12
2 12
3 12
4 12
5 12
6 12
7 12
dtype: int64
13.
➢ Series datastructure- Change the index
Example-
import pandas as pd
X1 = pd.Series(12,index=[1,2,3,4,5,6,7])
X2 = pd.Series(12,index=[1,2,3,4])
print(X1+X2)
OUTPUT
1 24.0
2 24.0
3 24.0
4 24.0
5 NaN
6 NaN
7 NaN
dtype: float64
Note- 1.In NumPy it will show an error of broadcasting whereas in
Pandas it shows an output of NaN.
2. Pandas works on the missing data.
14.
➢ Data framesin Pandas-
DataFrame: 1. Is a two-dimensional size-mutable,
heterogeneous tabular data structure with labeled axes
(rows and columns).
2. Pandas DataFrame consists of three principal
components, the data, rows, and columns.
3. Pandas DataFrame can be created from the lists,
dictionary, and from a list of dictionary etc.
15.
➢ Data framesin Pandas- Using List
import pandas as pd
L = [1,2,3,4,5,6,7]
var = pd.DataFrame(L)
print(var)
OUTPUT
0
0 1
1 2
2 3
3 4
4 5
5 6
6 7
16.
➢ Data framesin Pandas- Using Dictionary
import pandas as pd
dic = {"a":[1,2,3,4,5],"b":[1,2,3,4,5]}
var = pd.DataFrame(dic)
print(var)
OUTPUT
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
17.
➢ Data framesin Pandas- To work on column
import pandas as pd
dic = {"a":[1,2,3,4,5],"b":[1,2,3,4,5]}
var = pd.DataFrame(dic,columns=["a"])
print(var)
OUTPUT
a
0 1
1 2
2 3
3 4
4 5
18.
➢ Data framesin Pandas- To work on column
import pandas as pd
dic = {"a":[1,2,3,4,5],"b":[1,2,3,4,5],1:[1,2,3,4,5]}
var = pd.DataFrame(dic,columns=["a",1])
print(var)
OUTPUT
a 1
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
19.
➢ Data framesin Pandas- To work on column
import pandas as pd
dic = {"a":[1,2,3,4,5],"b":[1,2,3,4,5],1:[1,2,3,4,5]}
var=pd.DataFrame(dic,columns=["a",1],index=["s","u","m
","i","t"])
OUTPUT
a 1
s 1 1
u 2 2
m 3 3
i 4 4
t 5 5
20.
➢ Data framesin Pandas- To work on column
import pandas as pd
dic = {"a":[1,2,3,4,5],"b":[1,2,3,14,5],1:[1,2,3,4,5]}
var = pd.DataFrame(dic)
print(var)
print(var["b"][3])
OUTPUT
a b 1
0 1 1 1
1 2 2 2
2 3 3 3
3 4 14 4
4 5 5 5
14
➢ Data framesin Pandas- Using Series
import pandas as pd
sr ={"a":pd.Series([1,2,3,4]),"b":pd.Series([11,12,13,14])}
var = pd.DataFrame(sr)
print(var)
OUTPUT
a b
0 1 11
1 2 12
2 3 13
3 4 14
23.
➢ Arithmetic Operationsin Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14]}
var = pd.DataFrame(sr)
print(var)
OUTPUT
A B
0 1 11
1 2 12
2 3 13
3 4 14
Addition of A and B-
var["C"]=var["A"]+var["B"]
print(var)
OUTPUT
A B C
0 1 11 12
1 2 12 14
2 3 13 16
3 4 14 18
24.
➢ Arithmetic Operationsin Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14]}
var1 = pd.DataFrame(sr)
var1["Python"]=var1["A"]<=2
print(var1)
OUTPUT
A B Python
0 1 11 True
1 2 12 True
2 3 13 False
3 4 14 False
25.
➢ Delete andInsert Data in Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14]}
var1 = pd.DataFrame(sr)
print(var1)
OUTPUT
A B
0 1 11
1 2 12
2 3 13
3 4 14
var1.insert(1,"Python",var1["A"])
print(var1)
OUTPUT
A Python B
0 1 1 11
1 2 2 12
2 3 3 13
3 4 4 14
index name data to be inserted
26.
➢ Delete andInsert Data in Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14]}
var1 = pd.DataFrame(sr)
print(var1)
OUTPUT
A B
0 1 11
1 2 12
2 3 13
3 4 14
var1["Python"]=var1["A"][:3]
print(var1)
OUTPUT
A B Python
0 1 11 1.0
1 2 12 2.0
2 3 13 3.0
3 4 14 NaN
slicing data upto which it is to be copied
27.
➢ Delete andInsert Data in Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14],"C":[21,22,23,24]}
var1 = pd.DataFrame(sr)
print(var1)
OUTPUT
A B C
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
var2 = var1.pop("B")
var2
OUTPUT
0 11
1 12
2 13
3 14
Name: B, dtype: int64
28.
➢ Creation ofCSV files in Pandas-
Differences between CSV and XLS (Excel) file-
• CSV file is a plain text format in which values are
separeated by commas 9Comma Separated Values)
• XLS file format is an Excel Sheets binary file format
which holds information about all the worksheets in a
file, including both content and formatting.
29.
➢ Creation ofCSV files in Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14],"C":[21,22,23,24]}
var1 = pd.DataFrame(sr)
print(var1)
var1.to_csv("python.csv")
Note- Will create new CSV file in a folder where other python files are
available
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14],"C":[21,22,23,24]}
var1 = pd.DataFrame(sr)
print(var1)
var1.to_csv("python.csv", index=False)....to remove indexing
30.
➢ Creation ofCSV files in Pandas-
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14],"C":[21,22,23,24]}
var1 = pd.DataFrame(sr)
print(var1)
var1.to_csv("python.csv", header=False).....to remove header
OR
import pandas as pd
sr = {"A":[1,2,3,4],"B":[11,12,13,14],"C":[21,22,23,24]}
var1 = pd.DataFrame(sr)
print(var1)
var1.to_csv("python.csv", header=[11,12,13])