0% found this document useful (0 votes)
37 views33 pages

Dataframes UNIT 1 PART 2

The document provides an overview of data structures in Pandas, specifically focusing on Series, DataFrame, and Panel. It explains the characteristics, creation methods, and operations such as selection, indexing, and handling missing data for these structures. Additionally, it covers advanced concepts like vectorization and multi-indexing to enhance data manipulation efficiency.

Uploaded by

Rama Sugavanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views33 pages

Dataframes UNIT 1 PART 2

The document provides an overview of data structures in Pandas, specifically focusing on Series, DataFrame, and Panel. It explains the characteristics, creation methods, and operations such as selection, indexing, and handling missing data for these structures. Additionally, it covers advanced concepts like vectorization and multi-indexing to enhance data manipulation efficiency.

Uploaded by

Rama Sugavanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA STRUCTURE IN PANDAS

A data structure is a way to arrange the data in such a way that so it can be accessed quickly and we can
perform various operation on this data like- retrieval, deletion, modification etc.
Pandas deals with 3 data structure-
1. Series
2. Data Frame
3. Panel

Series-Series is a one-dimensional array like structure with homogeneous data, which can be used to
handle and manipulate data. What makes it special is its index attribute, which has incredible functionality
and is heavily mutable.
It has two parts-
1. Data part (An array of actual data)
2. Associated index with data (associated array of indexes or data labels)

e.g.-
Index Data
0 10
1 15
2 18
3 22

✓ We can say that Series is a labeled one-dimensional array which can hold any type of data.
✓ Data of Series is always mutable, means it can be changed.
✓ But the size of Data of Series is always immutable, means it cannot be changed.
✓ Series may be considered as a Data Structure with two arrays out which one array works as Index
(Labels) and the second array works as original Data.
✓ Row Labels in Series are called Index.

Syntax to create a Series:

<Series Object>=pandas.Series (data, index=idx (optional))

✓ Where data may be python sequence (Lists), ndarray, scalar value or a python dictionary.

How to create Series with nd array


Progra-
import pandas as
Output
-
impornumpy as Default Index
0 1
arr=np.array(
1 ,15,18,
22]
1 1
s =pd.Series(ar
) 2 1
prints) 3 22

Dat
Here we create
an
array of 4 values.

Program
-

import pandas as pd Output


-
importnumpy as np first a
arr=np.array(['a','b','c','d']
) second b
third c
s=pd.Series(
arr,
fourth d
index=['first','second','third','fourth'])

print(
s)

Creating a series from Scalar value


To create a series from scalar value, an index must be provided. The scalar value will be repeated as per
the length of index.
Creating a series from a Dictionary

Selection in Series
Series provides index label loc and ilocand [] to access rows and
columns.

1. loc index label :-

Syntax:-series_name.loc[StartRange: StopRange]
To Print Values from Index 0 to 2

To Print Values from Index 3 to 4

1. Selection Using iloc index label :-

Syntax:-series_name.iloc[StartRange : StopRange]

Example-
To Print Values from Index 0 to 1.

1. Selection Using [] :

Syntax:-series_name[StartRange> : StopRange] or
series_name[ index]
Example
-

To Print Values at Index 3.


Indexing in Series

Slicing in Series

Slicing is a way to retrieve subsets of data from a pandas object. A


slice object syntax is –

SERIES_NAME [start:end: step]


The segments start representing the first item, end representing
the last item, and step representing the increment between each
item that you would like. Example :-
DATAFRAME
DATAFRAME-It is a two-dimensional object that is useful in representing
data in the form of rows and columns. It is similar to a spreadsheet or an
SQL table. This is the most commonly used pandas object. Once we store the
data into the Dataframe, we can perform various operations that are useful in
analyzing and understanding the data.
DATAFRAME STRUCTURE
COLUMNS PLAYERNAME IPLTEAM BASEPRICEINCR

0 ROHIT MI 13
1 VIRAT RCB 17
2 HARDIK MI 14

INDEX DATA
PROPERTIES OF DATAFRAME

1. A Dataframe has axes (indices)-


➢ Row index (axis=0)
➢ Column index (axes=1)
2. It is similar to a spreadsheet , whose row index is called index and
column index is called column name.
3. A Dataframe contains Heterogeneous data.
4. A Dataframe Size is Mutable.
5. A Dataframe Data is Mutable.

A data frame can be created using any of the following-


1. Series
2. Lists
3. Dictionary
4. A numpy 2D array
How to create Dataframe From Series
Program- import pandas as pd s = Output-
pd.Series(['a','b','c','d'])
df=pd.DataFrame(s) print(df) 0
0 a

1 b Default Column Name As 0

2 c

3 d

DataFrame from Dictionary of Series


Example-
DataFrame from List of Dictionaries
Example-

Select operation in data frame


To access the column data ,we can mention the column name as
subscript.
e.g. - df[empid] This can also be done by using df.empid.
To access multiple columns we can write as df[ [col1, col2,---] ]
Example -
>> df. empid or df [ ‘empid’ ]
0 101
1 102
2 103
3 104
4 105
5 106
Name: empid, dtype: int64

>> df[[‘empid’,’ename’]]
e mpid ename
0 101 Sachin
1 102 Vinod
2 103 Lakhbir
3 104 Anil
4 105 Devinder
5 106 Uma Selvi

To Delete a Column in data frame


We can delete the column from a data frame by using any of the the following –
1. del
2. pop()
3. drop()

To Delete a Column Using drop()


import pandas as
pd
s= pd.Series([10,20,30,40])
df=pd.DataFrame(s
)df.columns=[‘List1’
]df[‘List2’]=
0
4
df1=df.drop(‘List2’,axis=1)
(axis=1) means to delete Data
column
[2,3],axis=0 wise
df2=df.drop(index (axis=0) means to delete
= ) data row wise with given index
print(df)
print(“ After deletion::”)
print(df1)
print (“ After row deletion::”)
print(df2)

Output-
List1 List2
0 10 40
1 20 40
2 30 40
3 40 40
After deletion::
List1
0 10
1 20
2 30
3 40
After row deletion::
List1
0 10
1 20
Accessing the data frame through loc() and
iloc() method or indexing using Labels
Pandas provide loc() and iloc() methods to access the subset from a data frame
using row/column.
Accessing the data frame through loc()
It is used to access a group of rows and columns.
Syntax- Df.loc[StartRow : EndRow, StartColumn : EndColumn]
Example-

To access first row

To access first 3 Rows


Example-
2:

To access single
column

To access Multiple namely TCS and


Column WIPRO
Exampl-3

To access first row


To access first 3 Rows

Accessing the data frame through


iloc()

It is used to access a group of rows and columns based on


numeric index value.

Syntax-

Df.loc[StartRowindexs : EndRowindex, StartColumnindex :


EndColumnindex]

Note -If we pass : in row or column part then pandas


provide the entire rows or columns respectively.
To access First two Rows
andSecond column

To access all Rows and First


Two columns Record
head() and tail()
The method head() gives the first 5 rows and the method tail() returns the
last 5 rows.
To display first 2 rows we can use head(2) and to returns last2 rows we can
use tail(2) and to return 3rd to 4th row we can write df[2:5].
import pandas as pd empdata={ 'Doj':['12-01-2012','15-01-2012','05-09-
2007',
'17-01-2012','05-09-2007','16-01-2012'], 'empid':
[101,102,103,104,105,106],
'ename':['Sachin','Vinod','Lakhbir','Anil','Devinder','UmaSelvi']
}
df=pd.DataFrame(empdata)
print(df) print(df.head(2))
print(df.tail(2)) print(df[2:5])
Output-
Doj empid ename
0 12-01-2012 101 Sachin
1 15-01-2012 102 Vinod
2 05-09-2007 103 Lakhbir
3 17-01- 2012 104 Anil
4 05-09-2007 105 Devinder
5 16-01-2012 106 UmaSelvi

Doj empid ename


0 12-01-2012 101 Sachin
1 15-01-2012 102 Vinod
head(2) displays first 2 rows

Doj empid ename


4 05-09-2007 105 Devinder
5 16-01-2012 106 UmaSelvi
tail(2) displays last 2 rows
Doj empid ename
2 05-09-2007 103 Lakhbir
3 17-01- 2012 104 Anil
4 05-09-2007 105 Devinder

df[2:5] display 2nd to 4th row


HANDLING MISSING DATA
Missing Data can occur when no information is provided for one or more items or for a whole unit.

Missing Data can also refer to as NA(Not Available) values in pandas For Example, Suppose different

users being surveyed may choose not to share their income, some users may choose not to share the

address in this way many datasets went missing.

1. isnull()

2. notnull()

3. dropna()

4. fillna()

5. replace()

6. interpolate()

1. isnull():

import pandas as pd

import numpy as np

dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third Score':[np.nan,

40, 80, 98]}

df = pd.DataFrame(dict)

df.isnull()

Output:

First Score Second Score Third Score

0 False False True

1 False False False

2 True False False

3 False True False

2. notnull():

df.notnull()

Output:

First Score Second Score Third Score

0 True True False

1 True True True


2 False True True

3 True False True

3. dropna():

df.dropna()

Output:

First ScoreSecond Score Third Score

1 90.0 45.0 40.0

4. fillna():

df.fillna(0)

Output:

First ScoreSecond Score Third Score

0100.0 30.0 0.0

190.0 45.0 40.0

20.0 56.0 80.0

395.0 0.0 98.0

Filling null values with the previous ones:

df.fillna(method ='pad')

Output:

First ScoreSecond Score Third Score

0 100.0 30.0 NaN

1 90.0 45.0 40.0

2 90.0 56.0 80.0

3 95.0 56.0 98.0

Filling null values with the next ones:

df.fillna(method ='bfill')

Output:

First ScoreSecond Score Third Score

0 100.0 30.0 40.0

1 90.0 45.0 40.0

2 95.0 56.0 80.0

3 95.0 NaN 98.0


5. replace():

df.replace(to_replace = np.nan, value = -99)

Output:

First ScoreSecond Score Third Score

0 100.0 30.0 -99.0

1 90.0 45.0 40.0

2 -99.0 56.0 80.0

3 95.0 -99.0 98.0

6. interpolate():

df.interpolate(method ='linear', limit_direction ='forward')

Output:

First ScoreSecondScore Third Score

0 100.0 30.0 NaN

1 90.0 45.0 40.0

2 92.5 56.0 80.0

3 95.0 56.0 98.0

import pandas as pd

import datetime

td = pd.Timedelta(133, unit='s')

print(td)

print(td.seconds)

Output:

0 days 00:02:13

133

VECTORIZATION CONCEPT IMPLEMENTATION USING PANDAS

Vectorization is used to speed up the Python code without using loop. Using such a function can help

in minimizing the running time of code efficiently.

Classic methods are more time consuming than using some standard function by calculating

their processing time.

 outer(a, b): Compute the outer product of two vectors.


 multiply(a, b): Matrix product of two arrays.
 dot(a, b): Dot product of two arrays.
 zeros((n, m)): Return a matrix of given shape and type, filled with zeros.
 process_time(): Return the value (in fractional seconds) of the sum of the system and user

CPU time of the current process. It does not include time elapsed during sleep.

MULTI INDEXING CONCEPTS


Multi-index allows you to select more than one row and column in your index.

Example: Creating multi-index from arrays

import pandas as pd

arrays = ['Sohom','Suresh','kumkum','subrata']

age= [10, 11, 12, 13]

marks=[90,92,23,64]

multi_index = pd.MultiIndex.from_arrays([arrays,age,marks], names=('names', 'age','marks'))

print(multi_index)

Output:

MultiIndex([( 'Sohom', 10, 90),

( 'Suresh', 11, 92),

( 'kumkum', 12, 23),

('subrata', 13, 64)],

names=['names', 'age', 'marks'])

 Example: Creating multi-index from DataFrame using Pandas.

import pandas as pd

dict = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"], 'Jobs': ["Software Developer", "System
Engineer","Footballer", "Singer"],

'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}

df = pd.DataFrame(dict)

print(df)

Output:

pd.MultiIndex.from_frame(df)

Output:

MultiIndex([( 'Saikat', 'Software Developer', 12.4),

('Shrestha', 'System Engineer', 5.6),

( 'Sandi', 'Footballer', 9.3),


( 'Abinash', 'Singer', 10.0)],

names=['name', 'Jobs', 'Annual Salary(L.P.A)'])

 Example: Using DataFrame.set_index([col1,col2,..])

import pandas as pd

data = {

'series': ['Peaky blinders', 'Sherlock', 'The crown', 'Queens Gambit', 'Friends'],

'Ratings': [4.5, 5, 3.9, 4.2, 5],

'Date': [2013, 2010, 2016, 2020, 1994]

df = pd.DataFrame(data)

df.set_index(["series", "Ratings"], inplace=True, append=True, drop=False)

print(df)

Output:

print(df.index)

Output:

MultiIndex([(0, 'Peaky blinders', 4.5),

(1, 'Sherlock', 5.0),

(2, 'The crown', 3.9),

(3, 'Queens Gambit', 4.2),

(4, 'Friends', 5.0)],

names=[None, 'series', 'Ratings'])

Reindexing in pandas

Reindexing is the process of conforming a DataFrame or Series to a new set of labels.

Key Points:

 You can add, remove, or rearrange indices.

 Missing values (NaN) are introduced if the new index does not match the existing data.

 Reindexing is useful for aligning data to a specific structure.

Reindexing Example:

import pandas as pd
data = {

'A': [1, 2, 3],

'B': [4, 5, 6]

df = pd.DataFrame(data, index=['a', 'b', 'c'])

# Reindex with a new set of indices

df_reindexed = df.reindex(['a', 'b', 'd', 'e'])

print(df_reindexed)

Output:

A B

a 1.0 4.0

b 2.0 5.0

d NaN NaN

e NaN NaN

Reindexing Columns:

df_reindexed_cols = df.reindex(columns=['A', 'B', 'C'])

print(df_reindexed_cols)

Hierarchical Indexing (MultiIndex)

Hierarchical indexing, or MultiIndex, allows you to create multi-level indices on rows or columns. This is
useful for handling higher-dimensional data in a 2D DataFrame.

Key Points:

 MultiIndex can have multiple levels.

 It helps in grouping and aggregating data efficiently.

 Accessing data becomes more flexible.

Creating a MultiIndex:

python

CopyEdit

arrays = [

['A', 'A', 'B', 'B'],


['one', 'two', 'one', 'two']

index = pd.MultiIndex.from_arrays(arrays, names=('Level 1', 'Level 2'))

df_multi = pd.DataFrame({

'Value': [10, 20, 30, 40]

}, index=index)

print(df_multi)

Output:

Value

Level 1 Level 2

A one 10

two 20

B one 30

two 40

Accessing MultiIndex Data:

# Access data at a specific level

print(df_multi.loc['A'])

Sorting by Levels:

python

CopyEdit

df_sorted = df_multi.sort_index(level=0)

print(df_sorted)

Resetting Index

To convert a MultiIndex DataFrame back to a regular one, use reset_index().

df_reset = df_multi.reset_index()

print(df_reset)
Stacking and Unstacking

Stacking and unstacking are operations used to reshape hierarchical data.

 Stack: Converts columns into rows.

 Unstack: Converts rows into columns.

Example:

# Stacking the columns into a single level of row index

df_stacked = df_multi.stack()

# Unstacking rows into columns

df_unstacked = df_multi.unstack()

Data Alignment in pandas


 Data alignment in pandas refers to the automatic matching of rows and columns between different
DataFrames and Series based on their labels (index and column names).

 It ensures that operations between multiple data structures are aligned correctly.

 When performing operations like addition or subtraction on two DataFrames, pandas aligns the
data by row and column labels.

 Missing labels result in NaN values in the output.

Example

import pandas as pd

df1 = pd.DataFrame({

'A': [1, 2, 3],

'B': [4, 5, 6]

}, index=['a', 'b', 'c'])

df2 = pd.DataFrame({

'B': [7, 8, 9],

'C': [10, 11, 12]

}, index=['b', 'c', 'd'])

# Automatic alignment during addition


result = df1 + df2

print(result)

Explanation:

 Columns and rows are aligned based on their labels.

 If a label is missing in either DataFrame, it results in NaN for that position.

2. Randomization (Shuffling Rows)

 Randomization refers to changing the order of rows in a DataFrame.

 This can be useful for tasks like shuffling a dataset for training machine learning models.

Method:

 Use sample() with frac=1 to shuffle all rows.

Example:

df_shuffled = df1.sample(frac=1).reset_index(drop=True)

print(df_shuffled)

Explanation:

 frac=1 indicates that all rows should be included in the sample.

 reset_index(drop=True) resets the index to avoid retaining the original index values.

3. Sorting in pandas

Sorting allows you to arrange data based on row indices or specific column values.

(a) Sorting by Index (sort_index)

 Used to sort rows or columns by their labels.

Example:

df_sorted_index = df1.sort_index(ascending=False)

print(df_sorted_index)

(b) Sorting by Values (sort_values)

 Used to sort rows based on the values in one or more columns.

Example:

df_sorted_values = df1.sort_values(by='A', ascending=False)

print(df_sorted_values)
Combining Operations

You can combine multiple operations like randomization and sorting for more complex workflows.

Example:

# Shuffle rows randomly and then sort by column 'B'

df_final = df1.sample(frac=1).sort_values(by='B')

print(df_final)

Data Acquisition
What is Data Acquisition?
Data acquisition refers to the process of collecting, measuring, and storing data
from different sources to be used in analysis and modeling.
Why is Data Acquisition Important?
 Ensures that the data collected is relevant and accurate.
 The quality of data significantly impacts the results of data analysis and
predictive models.
 Helps in building robust solutions in areas like business intelligence, machine
learning, and forecasting.
Data Gathering from Different Sources
In data science, gathering data from multiple sources is essential for building a
comprehensive dataset. The key sources include:
1. Internal Data – Data from within the organization (e.g., sales, customer
interactions, inventory).
2. External Data – Data from outside the organization, such as publicly available
datasets, third-party services, or web-based content.
Common External Sources:
 Web APIs
 Open Data Sources
 Web Scraping

1. Web APIs
Web APIs (Application Programming Interfaces)

A Web API is a service that allows you to access data hosted on web servers. APIs offer a
structured way to request data from a specific service or application.

How Web APIs Work:

 APIs expose endpoints that you can query using HTTP methods like GET, POST, PUT, or
DELETE.
 Responses are typically in JSON or XML format, making it easy to parse and process.

Examples of APIs:

 Weather Data: OpenWeather API provides real-time weather data.


 Social Media: Twitter and Facebook APIs offer social media insights.
 Financial Data: Alpha Vantage and Yahoo Finance provide stock market data

Example in Python (using requests library):


import requests
url = "https://api.openweathermap.org/data/2.5/weather"
params = {
"q": "London",
"appid": "YOUR_API_KEY"
}
response = requests.get(url, params=params)
data = response.json()
print(data)

2. Open Data Sources


Open data refers to datasets made available by public organizations, governments,
or research institutions. These datasets are often free and can be used for analysis
and research.
Popular Open Data Platforms:
 Kaggle: Contains a variety of datasets for machine learning and research.
 Data.gov: U.S. government’s open data portal with datasets on various
topics.
 World Bank Open Data: Global economic indicators.
 UCI Machine Learning Repository: A collection of datasets for machine
learning research.
Web scraping
It involves extracting data directly from websites. It is useful when data is not
available through APIs or structured sources.
How Web Scraping Works:
 Request the webpage’s content using tools like requests.
 Parse the content using libraries like BeautifulSoup.
 Extract specific elements (e.g., product prices, article titles).
Common Tools for Web Scraping:
 BeautifulSoup: For parsing and extracting data from HTML and XML.
 Scrapy: A powerful framework for building web scrapers.
 Selenium: Used for scraping dynamic content by simulating a browser
Example using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract title of the page


print(soup.title.text)
Applications:
 Extracting product prices from e-commerce websites.
 Gathering real estate data for price prediction.
Best Practices in Data Acquisition:
1. Respect Terms of Service: Ensure the data source allows automated access.
2. Handle Errors and Exceptions: Implement error handling when calling APIs or
scraping data.
3. Data Cleaning and Preprocessing: The acquired data may need cleaning
before analysis.

You might also like