DATA STRUCTURE IN PANDAS
A data structure is a way to arrange the data in such a way that so it can be accessed quickly and we can
perform various operation on this data like- retrieval, deletion, modification etc.
Pandas deals with 3 data structure-
1. Series
2. Data Frame
3. Panel
Series-Series is a one-dimensional array like structure with homogeneous data, which can be used to
handle and manipulate data. What makes it special is its index attribute, which has incredible functionality
and is heavily mutable.
It has two parts-
1. Data part (An array of actual data)
2. Associated index with data (associated array of indexes or data labels)
e.g.-
Index Data
0 10
1 15
2 18
3 22
✓ We can say that Series is a labeled one-dimensional array which can hold any type of data.
✓ Data of Series is always mutable, means it can be changed.
✓ But the size of Data of Series is always immutable, means it cannot be changed.
✓ Series may be considered as a Data Structure with two arrays out which one array works as Index
(Labels) and the second array works as original Data.
✓ Row Labels in Series are called Index.
Syntax to create a Series:
<Series Object>=pandas.Series (data, index=idx (optional))
✓ Where data may be python sequence (Lists), ndarray, scalar value or a python dictionary.
How to create Series with nd array
Progra-
import pandas as
Output
-
impornumpy as Default Index
0 1
arr=np.array(
1 ,15,18,
22]
1 1
s =pd.Series(ar
) 2 1
prints) 3 22
Dat
Here we create
an
array of 4 values.
Program
-
import pandas as pd Output
-
importnumpy as np first a
arr=np.array(['a','b','c','d']
) second b
third c
s=pd.Series(
arr,
fourth d
index=['first','second','third','fourth'])
print(
s)
Creating a series from Scalar value
To create a series from scalar value, an index must be provided. The scalar value will be repeated as per
the length of index.
Creating a series from a Dictionary
Selection in Series
Series provides index label loc and ilocand [] to access rows and
columns.
1. loc index label :-
Syntax:-series_name.loc[StartRange: StopRange]
To Print Values from Index 0 to 2
To Print Values from Index 3 to 4
1. Selection Using iloc index label :-
Syntax:-series_name.iloc[StartRange : StopRange]
Example-
To Print Values from Index 0 to 1.
1. Selection Using [] :
Syntax:-series_name[StartRange> : StopRange] or
series_name[ index]
Example
-
To Print Values at Index 3.
Indexing in Series
Slicing in Series
Slicing is a way to retrieve subsets of data from a pandas object. A
slice object syntax is –
SERIES_NAME [start:end: step]
The segments start representing the first item, end representing
the last item, and step representing the increment between each
item that you would like. Example :-
DATAFRAME
DATAFRAME-It is a two-dimensional object that is useful in representing
data in the form of rows and columns. It is similar to a spreadsheet or an
SQL table. This is the most commonly used pandas object. Once we store the
data into the Dataframe, we can perform various operations that are useful in
analyzing and understanding the data.
DATAFRAME STRUCTURE
COLUMNS PLAYERNAME IPLTEAM BASEPRICEINCR
0 ROHIT MI 13
1 VIRAT RCB 17
2 HARDIK MI 14
INDEX DATA
PROPERTIES OF DATAFRAME
1. A Dataframe has axes (indices)-
➢ Row index (axis=0)
➢ Column index (axes=1)
2. It is similar to a spreadsheet , whose row index is called index and
column index is called column name.
3. A Dataframe contains Heterogeneous data.
4. A Dataframe Size is Mutable.
5. A Dataframe Data is Mutable.
A data frame can be created using any of the following-
1. Series
2. Lists
3. Dictionary
4. A numpy 2D array
How to create Dataframe From Series
Program- import pandas as pd s = Output-
pd.Series(['a','b','c','d'])
df=pd.DataFrame(s) print(df) 0
0 a
1 b Default Column Name As 0
2 c
3 d
DataFrame from Dictionary of Series
Example-
DataFrame from List of Dictionaries
Example-
Select operation in data frame
To access the column data ,we can mention the column name as
subscript.
e.g. - df[empid] This can also be done by using df.empid.
To access multiple columns we can write as df[ [col1, col2,---] ]
Example -
>> df. empid or df [ ‘empid’ ]
0 101
1 102
2 103
3 104
4 105
5 106
Name: empid, dtype: int64
>> df[[‘empid’,’ename’]]
e mpid ename
0 101 Sachin
1 102 Vinod
2 103 Lakhbir
3 104 Anil
4 105 Devinder
5 106 Uma Selvi
To Delete a Column in data frame
We can delete the column from a data frame by using any of the the following –
1. del
2. pop()
3. drop()
To Delete a Column Using drop()
import pandas as
pd
s= pd.Series([10,20,30,40])
df=pd.DataFrame(s
)df.columns=[‘List1’
]df[‘List2’]=
0
4
df1=df.drop(‘List2’,axis=1)
(axis=1) means to delete Data
column
[2,3],axis=0 wise
df2=df.drop(index (axis=0) means to delete
= ) data row wise with given index
print(df)
print(“ After deletion::”)
print(df1)
print (“ After row deletion::”)
print(df2)
Output-
List1 List2
0 10 40
1 20 40
2 30 40
3 40 40
After deletion::
List1
0 10
1 20
2 30
3 40
After row deletion::
List1
0 10
1 20
Accessing the data frame through loc() and
iloc() method or indexing using Labels
Pandas provide loc() and iloc() methods to access the subset from a data frame
using row/column.
Accessing the data frame through loc()
It is used to access a group of rows and columns.
Syntax- Df.loc[StartRow : EndRow, StartColumn : EndColumn]
Example-
To access first row
To access first 3 Rows
Example-
2:
To access single
column
To access Multiple namely TCS and
Column WIPRO
Exampl-3
To access first row
To access first 3 Rows
Accessing the data frame through
iloc()
It is used to access a group of rows and columns based on
numeric index value.
Syntax-
Df.loc[StartRowindexs : EndRowindex, StartColumnindex :
EndColumnindex]
Note -If we pass : in row or column part then pandas
provide the entire rows or columns respectively.
To access First two Rows
andSecond column
To access all Rows and First
Two columns Record
head() and tail()
The method head() gives the first 5 rows and the method tail() returns the
last 5 rows.
To display first 2 rows we can use head(2) and to returns last2 rows we can
use tail(2) and to return 3rd to 4th row we can write df[2:5].
import pandas as pd empdata={ 'Doj':['12-01-2012','15-01-2012','05-09-
2007',
'17-01-2012','05-09-2007','16-01-2012'], 'empid':
[101,102,103,104,105,106],
'ename':['Sachin','Vinod','Lakhbir','Anil','Devinder','UmaSelvi']
}
df=pd.DataFrame(empdata)
print(df) print(df.head(2))
print(df.tail(2)) print(df[2:5])
Output-
Doj empid ename
0 12-01-2012 101 Sachin
1 15-01-2012 102 Vinod
2 05-09-2007 103 Lakhbir
3 17-01- 2012 104 Anil
4 05-09-2007 105 Devinder
5 16-01-2012 106 UmaSelvi
Doj empid ename
0 12-01-2012 101 Sachin
1 15-01-2012 102 Vinod
head(2) displays first 2 rows
Doj empid ename
4 05-09-2007 105 Devinder
5 16-01-2012 106 UmaSelvi
tail(2) displays last 2 rows
Doj empid ename
2 05-09-2007 103 Lakhbir
3 17-01- 2012 104 Anil
4 05-09-2007 105 Devinder
df[2:5] display 2nd to 4th row
HANDLING MISSING DATA
Missing Data can occur when no information is provided for one or more items or for a whole unit.
Missing Data can also refer to as NA(Not Available) values in pandas For Example, Suppose different
users being surveyed may choose not to share their income, some users may choose not to share the
address in this way many datasets went missing.
1. isnull()
2. notnull()
3. dropna()
4. fillna()
5. replace()
6. interpolate()
1. isnull():
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third Score':[np.nan,
40, 80, 98]}
df = pd.DataFrame(dict)
df.isnull()
Output:
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False
2. notnull():
df.notnull()
Output:
First Score Second Score Third Score
0 True True False
1 True True True
2 False True True
3 True False True
3. dropna():
df.dropna()
Output:
First ScoreSecond Score Third Score
1 90.0 45.0 40.0
4. fillna():
df.fillna(0)
Output:
First ScoreSecond Score Third Score
0100.0 30.0 0.0
190.0 45.0 40.0
20.0 56.0 80.0
395.0 0.0 98.0
Filling null values with the previous ones:
df.fillna(method ='pad')
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 90.0 56.0 80.0
3 95.0 56.0 98.0
Filling null values with the next ones:
df.fillna(method ='bfill')
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 40.0
1 90.0 45.0 40.0
2 95.0 56.0 80.0
3 95.0 NaN 98.0
5. replace():
df.replace(to_replace = np.nan, value = -99)
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 -99.0
1 90.0 45.0 40.0
2 -99.0 56.0 80.0
3 95.0 -99.0 98.0
6. interpolate():
df.interpolate(method ='linear', limit_direction ='forward')
Output:
First ScoreSecondScore Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 92.5 56.0 80.0
3 95.0 56.0 98.0
import pandas as pd
import datetime
td = pd.Timedelta(133, unit='s')
print(td)
print(td.seconds)
Output:
0 days 00:02:13
133
VECTORIZATION CONCEPT IMPLEMENTATION USING PANDAS
Vectorization is used to speed up the Python code without using loop. Using such a function can help
in minimizing the running time of code efficiently.
Classic methods are more time consuming than using some standard function by calculating
their processing time.
outer(a, b): Compute the outer product of two vectors.
multiply(a, b): Matrix product of two arrays.
dot(a, b): Dot product of two arrays.
zeros((n, m)): Return a matrix of given shape and type, filled with zeros.
process_time(): Return the value (in fractional seconds) of the sum of the system and user
CPU time of the current process. It does not include time elapsed during sleep.
MULTI INDEXING CONCEPTS
Multi-index allows you to select more than one row and column in your index.
Example: Creating multi-index from arrays
import pandas as pd
arrays = ['Sohom','Suresh','kumkum','subrata']
age= [10, 11, 12, 13]
marks=[90,92,23,64]
multi_index = pd.MultiIndex.from_arrays([arrays,age,marks], names=('names', 'age','marks'))
print(multi_index)
Output:
MultiIndex([( 'Sohom', 10, 90),
( 'Suresh', 11, 92),
( 'kumkum', 12, 23),
('subrata', 13, 64)],
names=['names', 'age', 'marks'])
Example: Creating multi-index from DataFrame using Pandas.
import pandas as pd
dict = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"], 'Jobs': ["Software Developer", "System
Engineer","Footballer", "Singer"],
'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}
df = pd.DataFrame(dict)
print(df)
Output:
pd.MultiIndex.from_frame(df)
Output:
MultiIndex([( 'Saikat', 'Software Developer', 12.4),
('Shrestha', 'System Engineer', 5.6),
( 'Sandi', 'Footballer', 9.3),
( 'Abinash', 'Singer', 10.0)],
names=['name', 'Jobs', 'Annual Salary(L.P.A)'])
Example: Using DataFrame.set_index([col1,col2,..])
import pandas as pd
data = {
'series': ['Peaky blinders', 'Sherlock', 'The crown', 'Queens Gambit', 'Friends'],
'Ratings': [4.5, 5, 3.9, 4.2, 5],
'Date': [2013, 2010, 2016, 2020, 1994]
df = pd.DataFrame(data)
df.set_index(["series", "Ratings"], inplace=True, append=True, drop=False)
print(df)
Output:
print(df.index)
Output:
MultiIndex([(0, 'Peaky blinders', 4.5),
(1, 'Sherlock', 5.0),
(2, 'The crown', 3.9),
(3, 'Queens Gambit', 4.2),
(4, 'Friends', 5.0)],
names=[None, 'series', 'Ratings'])
Reindexing in pandas
Reindexing is the process of conforming a DataFrame or Series to a new set of labels.
Key Points:
You can add, remove, or rearrange indices.
Missing values (NaN) are introduced if the new index does not match the existing data.
Reindexing is useful for aligning data to a specific structure.
Reindexing Example:
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6]
df = pd.DataFrame(data, index=['a', 'b', 'c'])
# Reindex with a new set of indices
df_reindexed = df.reindex(['a', 'b', 'd', 'e'])
print(df_reindexed)
Output:
A B
a 1.0 4.0
b 2.0 5.0
d NaN NaN
e NaN NaN
Reindexing Columns:
df_reindexed_cols = df.reindex(columns=['A', 'B', 'C'])
print(df_reindexed_cols)
Hierarchical Indexing (MultiIndex)
Hierarchical indexing, or MultiIndex, allows you to create multi-level indices on rows or columns. This is
useful for handling higher-dimensional data in a 2D DataFrame.
Key Points:
MultiIndex can have multiple levels.
It helps in grouping and aggregating data efficiently.
Accessing data becomes more flexible.
Creating a MultiIndex:
python
CopyEdit
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
index = pd.MultiIndex.from_arrays(arrays, names=('Level 1', 'Level 2'))
df_multi = pd.DataFrame({
'Value': [10, 20, 30, 40]
}, index=index)
print(df_multi)
Output:
Value
Level 1 Level 2
A one 10
two 20
B one 30
two 40
Accessing MultiIndex Data:
# Access data at a specific level
print(df_multi.loc['A'])
Sorting by Levels:
python
CopyEdit
df_sorted = df_multi.sort_index(level=0)
print(df_sorted)
Resetting Index
To convert a MultiIndex DataFrame back to a regular one, use reset_index().
df_reset = df_multi.reset_index()
print(df_reset)
Stacking and Unstacking
Stacking and unstacking are operations used to reshape hierarchical data.
Stack: Converts columns into rows.
Unstack: Converts rows into columns.
Example:
# Stacking the columns into a single level of row index
df_stacked = df_multi.stack()
# Unstacking rows into columns
df_unstacked = df_multi.unstack()
Data Alignment in pandas
Data alignment in pandas refers to the automatic matching of rows and columns between different
DataFrames and Series based on their labels (index and column names).
It ensures that operations between multiple data structures are aligned correctly.
When performing operations like addition or subtraction on two DataFrames, pandas aligns the
data by row and column labels.
Missing labels result in NaN values in the output.
Example
import pandas as pd
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'B': [7, 8, 9],
'C': [10, 11, 12]
}, index=['b', 'c', 'd'])
# Automatic alignment during addition
result = df1 + df2
print(result)
Explanation:
Columns and rows are aligned based on their labels.
If a label is missing in either DataFrame, it results in NaN for that position.
2. Randomization (Shuffling Rows)
Randomization refers to changing the order of rows in a DataFrame.
This can be useful for tasks like shuffling a dataset for training machine learning models.
Method:
Use sample() with frac=1 to shuffle all rows.
Example:
df_shuffled = df1.sample(frac=1).reset_index(drop=True)
print(df_shuffled)
Explanation:
frac=1 indicates that all rows should be included in the sample.
reset_index(drop=True) resets the index to avoid retaining the original index values.
3. Sorting in pandas
Sorting allows you to arrange data based on row indices or specific column values.
(a) Sorting by Index (sort_index)
Used to sort rows or columns by their labels.
Example:
df_sorted_index = df1.sort_index(ascending=False)
print(df_sorted_index)
(b) Sorting by Values (sort_values)
Used to sort rows based on the values in one or more columns.
Example:
df_sorted_values = df1.sort_values(by='A', ascending=False)
print(df_sorted_values)
Combining Operations
You can combine multiple operations like randomization and sorting for more complex workflows.
Example:
# Shuffle rows randomly and then sort by column 'B'
df_final = df1.sample(frac=1).sort_values(by='B')
print(df_final)
Data Acquisition
What is Data Acquisition?
Data acquisition refers to the process of collecting, measuring, and storing data
from different sources to be used in analysis and modeling.
Why is Data Acquisition Important?
Ensures that the data collected is relevant and accurate.
The quality of data significantly impacts the results of data analysis and
predictive models.
Helps in building robust solutions in areas like business intelligence, machine
learning, and forecasting.
Data Gathering from Different Sources
In data science, gathering data from multiple sources is essential for building a
comprehensive dataset. The key sources include:
1. Internal Data – Data from within the organization (e.g., sales, customer
interactions, inventory).
2. External Data – Data from outside the organization, such as publicly available
datasets, third-party services, or web-based content.
Common External Sources:
Web APIs
Open Data Sources
Web Scraping
1. Web APIs
Web APIs (Application Programming Interfaces)
A Web API is a service that allows you to access data hosted on web servers. APIs offer a
structured way to request data from a specific service or application.
How Web APIs Work:
APIs expose endpoints that you can query using HTTP methods like GET, POST, PUT, or
DELETE.
Responses are typically in JSON or XML format, making it easy to parse and process.
Examples of APIs:
Weather Data: OpenWeather API provides real-time weather data.
Social Media: Twitter and Facebook APIs offer social media insights.
Financial Data: Alpha Vantage and Yahoo Finance provide stock market data
Example in Python (using requests library):
import requests
url = "https://api.openweathermap.org/data/2.5/weather"
params = {
"q": "London",
"appid": "YOUR_API_KEY"
}
response = requests.get(url, params=params)
data = response.json()
print(data)
2. Open Data Sources
Open data refers to datasets made available by public organizations, governments,
or research institutions. These datasets are often free and can be used for analysis
and research.
Popular Open Data Platforms:
Kaggle: Contains a variety of datasets for machine learning and research.
Data.gov: U.S. government’s open data portal with datasets on various
topics.
World Bank Open Data: Global economic indicators.
UCI Machine Learning Repository: A collection of datasets for machine
learning research.
Web scraping
It involves extracting data directly from websites. It is useful when data is not
available through APIs or structured sources.
How Web Scraping Works:
Request the webpage’s content using tools like requests.
Parse the content using libraries like BeautifulSoup.
Extract specific elements (e.g., product prices, article titles).
Common Tools for Web Scraping:
BeautifulSoup: For parsing and extracting data from HTML and XML.
Scrapy: A powerful framework for building web scrapers.
Selenium: Used for scraping dynamic content by simulating a browser
Example using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# Extract title of the page
print(soup.title.text)
Applications:
Extracting product prices from e-commerce websites.
Gathering real estate data for price prediction.
Best Practices in Data Acquisition:
1. Respect Terms of Service: Ensure the data source allows automated access.
2. Handle Errors and Exceptions: Implement error handling when calling APIs or
scraping data.
3. Data Cleaning and Preprocessing: The acquired data may need cleaning
before analysis.