0% found this document useful (0 votes)

22 views11 pages

Unit-2 Bda

Uploaded by

claritysubhash55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views11 pages

Unit-2 Bda

Uploaded by

claritysubhash55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT – 2

PYTHON FOR DATA ANALYTICS

Python is a general purpose language and is often used for things other than data
analysis and data science. What makes Python extremely useful for working with
data?

There are libraries that give users the necessary functionality when crunching data.
Below are the major Python libraries that are used for working with data. You should
take some time to familiarize yourself with the basic purposes of these packages.

Numpy
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with
other low level languages like Fortran, C and C++.

Pandas – Data Manipulation and Analysis

Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.

Loading and manipulating data with

Pandas DataFr ames

Loading and manipulating data with Pandas DataFrames is a crucial step in data analysis with
Python. Here are some basic steps to load and manipulate data with Pandas DataFrames.

1. Loading data: You can load data into a Pandas DataFrame from various sources such as CSV
files, Excel files, SQL databases, and APIs. You can use
the read_csv(), read_excel(), read_sql(), and read_json() functions in
Pandas to read data from different sources.
2. Exploring data: Once you load data into a DataFrame, you can explore it using various
functions such as head(), tail(), describe(), info(), shape, columns,
and dtypes. These functions provide basic information about the DataFrame, such as the
column names, data types, and summary statistics.
3. Cleaning data: Data cleaning is an essential step in data analysis to ensure data quality. You
can clean data using various functions such as dropna(), fillna(), replace(),
and drop_duplicates(). These functions help you handle missing values, duplicate
rows, and inconsistent data.
4. Manipulating data: You can manipulate data in a DataFrame using functions such
as groupby(), pivot_table(), merge(), and concat(). These functions allow you
to group data, pivot tables, and combine data from multiple sources.
5. Visualizing data: You can use Pandas’ built-in visualization tools to create various plots such
as bar plots, line plots, scatter plots, and histograms. These plots help you visualize the data
and gain insights into data trends.
6. Exporting data: Once you analyze and manipulate data, you may need to export the results to
various file formats such as CSV, Excel, or SQL databases. You can use
the to_csv(), to_excel(), to_sql(), and to_json() functions in Pandas to export
data.

Loading and manipulating data using Pandas DataFrame in Python is a common task for data
analysis. Here's a brief overview of how to accomplish this:

Step 1: Install and Import Pandas

First, ensure you have Pandas installed. If not, you can install it using pip:

sh
Copy code
pip install pandas

Then, import Pandas in your Python script or notebook:

python
Copy code
import pandas as pd
Step 2: Load Data

Pandas can read data from various file formats such as CSV, Excel, SQL, and more. The
most common method is reading from a CSV file:

python
Copy code
df = pd.read_csv('path_to_your_file.csv')

For other formats:

python
Copy code
# Excel
df = pd.read_excel('path_to_your_file.xlsx')

# SQL
from sqlalchemy import create_engine
engine = create_engine('your_database_connection_string')
df = pd.read_sql('your_table_name', engine)
Step 3: Basic DataFrame Operations

Once the data is loaded, you can perform various operations on the DataFrame.

Viewing Data

python
Copy code
# Display the first few rows
print(df.head())

# Display summary information

print(df.info())

# Display basic statistics

print(df.describe())

Selecting Data

python
Copy code
# Select a single column
col = df['column_name']

# Select multiple columns

cols = df[['column_name1', 'column_name2']]

# Select rows by index

row = df.iloc[0] # First row
rows = df.iloc[0:5] # First 5 rows

# Select rows based on condition

filtered_df = df[df['column_name'] > value]

Adding/Modifying Columns

python
Copy code
# Add a new column
df['new_column'] = value

# Modify an existing column

df['existing_column'] = df['existing_column'].apply(lambda x: x + 1)

Handling Missing Data

python
Copy code
# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values

df = df.dropna()

# Fill missing values

df = df.fillna(value)

Grouping and Aggregation

python
Copy code
# Group by a column and calculate the mean of each group
grouped_df = df.groupby('column_name').mean()

# Group by multiple columns and perform multiple aggregations

agg_df = df.groupby(['column1', 'column2']).agg({'column3': 'mean', 'column4': 'sum'})
Step 4: Save Data

After manipulation, you might want to save the DataFrame back to a file:

python
Copy code
# Save to CSV
df.to_csv('path_to_save_file.csv', index=False)

# Save to Excel
df.to_excel('path_to_save_file.xlsx', index=False)

# Save to SQL
df.to_sql('your_table_name', engine, if_exists='replace', index=False)
Example: Complete Workflow

Here is a complete example that loads data, manipulates it, and saves the result:

python
Copy code
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Display first 5 rows

print(df.head())

# Filter rows where 'column1' > 10

filtered_df = df[df['column1'] > 10]

# Add a new column 'new_column' which is 'column2' squared

filtered_df['new_column'] = filtered_df['column2'] ** 2
# Group by 'column1' and calculate the mean of 'new_column'
grouped_df = filtered_df.groupby('column1')['new_column'].mean().reset_index()

# Save the result to a new CSV file

grouped_df.to_csv('processed_data.csv', index=False)

What is Data Manipulation?

Data manipulation in data science is a crucial step that helps unfold patterns that eventually help make
informed decisions. Simply, it refers to modifying, transforming, or reorganising data to extract
meaningful insights, prepare it for analysis, or meet specific requirements.

Key Features of Data Manipulation

Data Filtering
Data filtering:

It is crucial for manipulating data and extracting pertinent insights from raw datasets. Filtering
streamlines analysis processes by selectively isolating specific data points or patterns, which enhances
efficiency. It ensures that only relevant information contributes to informed decision-making and
insightful discoveries.

Data Sorting:
Sorting data or structuring it into columns and rows enhances readability and comprehension.
Analysts can quickly identify patterns, outliers, and trends by organising data logically, streamlining
the analysis process. This structured presentation aids in extracting meaningful insights and making
informed decisions based on a clear understanding of the data.

Data Aggregation:
Aggregation, a vital data manipulation feature, condenses multiple records into a concise summary in
data science. It encompasses computing averages, summations, accounting totals, and identifying
maximum or minimum values. It streamlines analysis processes and yields actionable insights from
complex datasets.

Example 1: Filtering and Sorting:

One fundamental data manipulation task is filtering and sorting. It involves selecting specific rows or
columns based on certain criteria and arranging the data in order. For instance, in a customer database,
you might filter the records only to include customers who purchased in the last month and then sort
them based on their total spending.

Example 2: Aggregation and Summarisation:

Another essential aspect of data manipulation is aggregating and summarising data. It involves
calculating summary statistics, such as a specific variable’s average, sum, minimum, or maximum
values. For instance, in sales data, you might aggregate the total revenue generated per product
category or calculate the monthly average sales.

DATA CLEANING AND IMPORTING

Reading data from various sources such as CSV files, Excel files, and databases is a common
task in data analysis. Here is how you can accomplish this using Pandas in Python.
Step 1: Set Up the Environment

Make sure you have Pandas installed. If not, install it using pip:

sh
Copy code
pip install pandas
Step 2: Import Pandas

Import Pandas in your Python script or Jupyter Notebook:

python
Copy code
import pandas as pd
Reading Data from CSV

To read data from a CSV file, use the pd.read_csv function:

python
Copy code
# Reading data from a CSV file
csv_file_path = 'path_to_your_file.csv'
df_csv = pd.read_csv(csv_file_path)
print(df_csv.head())
Reading Data from Excel

To read data from an Excel file, use the pd.read_excel function:

python
Copy code
# Reading data from an Excel file
excel_file_path = 'path_to_your_file.xlsx'
df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1') # Specify
sheet name if necessary
print(df_excel.head())
Reading Data from Databases

To read data from a database, you will need to use SQLAlchemy to establish a connection.
Install SQLAlchemy if you haven't:

sh
Copy code
pip install sqlalchemy

Then, use the create_engine function to establish a connection and pd.read_sql to read
the data:

python
Copy code
from sqlalchemy import create_engine

# Replace with your actual database connection string

database_connection_string =
'dialect+driver://username:password@host:port/database'
# Create an engine
engine = create_engine(database_connection_string)

# Reading data from a SQL table

table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print(df_sql.head())

# Reading data using a SQL query

sql_query = 'SELECT * FROM your_table_name WHERE some_column > some_value'
df_sql_query = pd.read_sql(sql_query, con=engine)
print(df_sql_query.head())
Example: Combined Data Reading

Here's an example of reading data from CSV, Excel, and SQL sources in one script:

python
Copy code
import pandas as pd
from sqlalchemy import create_engine

# Paths to your files

csv_file_path = 'path_to_your_file.csv'
excel_file_path = 'path_to_your_file.xlsx'

# Database connection string

database_connection_string =
'dialect+driver://username:password@host:port/database'
engine = create_engine(database_connection_string)

# Reading data from CSV

df_csv = pd.read_csv(csv_file_path)
print('CSV Data:')
print(df_csv.head())

# Reading data from Excel

df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1')
print('Excel Data:')
print(df_excel.head())

# Reading data from SQL

table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print('SQL Data:')
print(df_sql.head())

DATA CLEANING TECHNIQUES

Handling missing values

it is a critical step in data preprocessing and can significantly impact the performance of
machine learning models. There are several strategies to deal with missing data:
1. Remove Missing Values

 Row Deletion: Remove rows with missing values.

o Useful when the dataset is large and missing values are few.
 Column Deletion: Remove columns with missing values.
o Appropriate when a column has a high percentage of missing values and is
less important.

2. Impute Missing Values

 Mean/Median/Mode Imputation: Replace missing values with the mean (for

numerical data), median, or mode (for categorical data) of the column.
o Simple and fast, but can distort the data distribution.
 Forward/Backward Fill: Replace missing values with the previous/next value in the
column.
o Suitable for time series data.
 Interpolation: Use linear or polynomial interpolation for numerical data.
 K-Nearest Neighbors (KNN) Imputation: Replace missing values with the average
value of the nearest neighbors.
o Takes into account the similarity between data points but is computationally
expensive.
 Multivariate Imputation by Chained Equations (MICE): Uses the relationships
between features to predict missing values.
o More sophisticated and accurate but also computationally intensive.

3. Use Algorithms that Support Missing Values

 Some machine learning algorithms can handle missing values natively, such as
decision trees, Random Forests, and XGBoost.

4. Flag and Fill

 Create a new binary column to indicate whether the value was missing and then fill
the missing value using one of the imputation methods.
 Preserves information about missingness, which can be useful for some models.

5. Prediction Models

 Use machine learning models to predict and impute missing values based on other
features.

HANDLING DUPLICATES:

Dealing with duplicates is a crucial aspect of data cleaning to ensure the quality and
reliability of the dataset. Here are some techniques to handle duplicates effectively:

1. Identifying Duplicates

 Exact Duplicates: Rows that are completely identical across all columns.
 Partial Duplicates: Rows that are identical in specific key columns but may have different
values in other columns.

2. Removing Duplicates

 Remove Exact Duplicates: Use functions to identify and remove rows that are exact
duplicates.

3. Handling Partial Duplicates

 Prioritize Based on a Column: Keep the duplicate row based on the value of a particular
column (e.g., latest timestamp).
 Aggregation: Aggregate duplicate rows by summarizing or averaging numerical values.
 Manual Review: Sometimes manual review is necessary for critical data.

4. Combining Information from Duplicates

 Merge Information: Combine information from duplicate rows, especially if different rows
have different useful information.

HANDLING OUTLIERS:
Handling outliers is an essential part of data preprocessing, as outliers can significantly
impact the performance of statistical analyses and machine learning models. Here are several
techniques to identify and handle outliers:

Identifying Outliers

1. Statistical Methods
o Z-Score: Measures the number of standard deviations a data point is from the
mean.

python
Copy code
from scipy.stats import zscore
import numpy as np

df['z_score'] = zscore(df['column_name'])
outliers = df[np.abs(df['z_score']) > 3]

o IQR (Interquartile Range): Identifies outliers based on the range between

the first quartile (Q1) and the third quartile (Q3).

python
Copy code
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) |
(df['column_name'] > upper_bound)]
2. Visual Methods
o Box Plot: Displays the distribution of data and highlights outliers.

python
Copy code
import matplotlib.pyplot as plt

df['column_name'].plot.box()
plt.show()

o Scatter Plot: Useful for identifying outliers in a two-dimensional dataset.

python
Copy code
df.plot.scatter(x='column_x', y='column_y')
plt.show()

3. Model-Based Methods
o Isolation Forest: Identifies anomalies by isolating observations.

python
Copy code
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.1)
df['anomaly_score'] =
iso_forest.fit_predict(df[['column_name']])
outliers = df[df['anomaly_score'] == -1]
Handling Outliers

1. Removing Outliers
o Simply remove the outliers identified by statistical or visual methods.

python
Copy code
df_cleaned = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]

2. Transforming Data
o Log Transformation: Reduces the impact of outliers.

python
Copy code
df['log_column'] = np.log(df['column_name'] + 1) # Adding 1 to
avoid log(0)

o Square Root Transformation: Similar to log transformation, but less

aggressive.

python
Copy code
df['sqrt_column'] = np.sqrt(df['column_name'])

3. Capping/Flooring
o Limit the values of outliers to the nearest threshold (e.g., the upper or lower bound
of IQR).

python
Copy code
df['column_name'] = np.where(df['column_name'] > upper_bound,
upper_bound,
np.where(df['column_name'] <
lower_bound, lower_bound, df['column_name']))

4. Imputation
o Replace outliers with statistical measures like mean or median.

python
Copy code
median = df['column_name'].median()
df['column_name'] = np.where((df['column_name'] < lower_bound)
| (df['column_name'] > upper_bound), median, df['column_name'])

5. Model-Based Methods
o Use robust algorithms that are less sensitive to outliers, such as decision trees or
robust regression models.

Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Python For Analytics - 2025 - 2020
No ratings yet
Python For Analytics - 2025 - 2020
28 pages
Beginners Guide To Python For Data Analysis
No ratings yet
Beginners Guide To Python For Data Analysis
2 pages
Pandas Library: Data Manipulation & Analysis Guide
No ratings yet
Pandas Library: Data Manipulation & Analysis Guide
9 pages
Data Wrangling & Data Manipulation With Pandas
No ratings yet
Data Wrangling & Data Manipulation With Pandas
6 pages
Python & MySQL For Data Analysis
No ratings yet
Python & MySQL For Data Analysis
45 pages
Datascience
No ratings yet
Datascience
26 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Dav 2 Unit
No ratings yet
Dav 2 Unit
55 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Python For Data Analysis Notes
No ratings yet
Python For Data Analysis Notes
3 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Pandas Guide for Beginners
No ratings yet
Pandas Guide for Beginners
18 pages
Python for Data Analysts
No ratings yet
Python for Data Analysts
2 pages
Pandas
No ratings yet
Pandas
2 pages
Pandas
No ratings yet
Pandas
50 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Python & Excel for Data Science
No ratings yet
Python & Excel for Data Science
19 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Python Data Exploration Guide
100% (1)
Python Data Exploration Guide
12 pages
Python For Data Analysts - Quick Summary
No ratings yet
Python For Data Analysts - Quick Summary
6 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
84 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Python
No ratings yet
Python
170 pages
Data Analyst Masters with PowerBI
No ratings yet
Data Analyst Masters with PowerBI
27 pages
Test 1 Datasheet
No ratings yet
Test 1 Datasheet
3 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Report
No ratings yet
Report
18 pages
Data Sciene File
No ratings yet
Data Sciene File
36 pages
Pandas
No ratings yet
Pandas
10 pages
BasicAnalysis Using PYTHON
No ratings yet
BasicAnalysis Using PYTHON
6 pages
FDS Exp4
No ratings yet
FDS Exp4
5 pages
Python by Example Book 2 (Data Manipulation and Analysis)
No ratings yet
Python by Example Book 2 (Data Manipulation and Analysis)
105 pages
Learn Pandas
No ratings yet
Learn Pandas
37 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
DS Final
No ratings yet
DS Final
46 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
10 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Data Analysis For Beginners Book - 2
100% (1)
Data Analysis For Beginners Book - 2
27 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Pandas
No ratings yet
Pandas
28 pages
Biostatistics: Frequency & Descriptive Stats
No ratings yet
Biostatistics: Frequency & Descriptive Stats
19 pages
Eliciting Probability Distributions: Jeremy E. Oakley March 4, 2010
No ratings yet
Eliciting Probability Distributions: Jeremy E. Oakley March 4, 2010
22 pages
Statistics for LLB Students
No ratings yet
Statistics for LLB Students
22 pages
Measures of Central Tendency Guide
No ratings yet
Measures of Central Tendency Guide
47 pages
Multiple Choice Questions
100% (1)
Multiple Choice Questions
8 pages
April 15
No ratings yet
April 15
11 pages
A Detailed Lesson Plan For Grade 10 Unit 4 - Quartiles
100% (2)
A Detailed Lesson Plan For Grade 10 Unit 4 - Quartiles
6 pages
Unit 5-Data Visualization
No ratings yet
Unit 5-Data Visualization
22 pages
GCSE Maths: Cumulative Frequency
No ratings yet
GCSE Maths: Cumulative Frequency
18 pages
A Deep Learning Model Integrating A Wind Direction-Based Dynamic Graph Network For Ozone Prediction
No ratings yet
A Deep Learning Model Integrating A Wind Direction-Based Dynamic Graph Network For Ozone Prediction
15 pages
Sample Lesson Plan
No ratings yet
Sample Lesson Plan
19 pages
X ICSE Maths Solution
No ratings yet
X ICSE Maths Solution
20 pages
Impact of Social Media Usage On Students Academic
No ratings yet
Impact of Social Media Usage On Students Academic
7 pages
A LESSON PLAN IN Measures of Position (Quartile) : March 2019
No ratings yet
A LESSON PLAN IN Measures of Position (Quartile) : March 2019
10 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Project Sample Solution
100% (1)
Project Sample Solution
17 pages
Quantiles Mendelhall and Sincich Method
No ratings yet
Quantiles Mendelhall and Sincich Method
26 pages
Module6 Statistical Tools
No ratings yet
Module6 Statistical Tools
29 pages
4 Numerical Methods For Describing Data
No ratings yet
4 Numerical Methods For Describing Data
50 pages
Impact of Food Bloggers' Review On Consumer's
No ratings yet
Impact of Food Bloggers' Review On Consumer's
17 pages
January 2011 QP - S1 Edexcel
No ratings yet
January 2011 QP - S1 Edexcel
15 pages
Learning Area Grade Level Quarter Date I. Lesson Title Ii. Most Essential Learning Competencies (Melcs) Iii. Content/Core Content
100% (2)
Learning Area Grade Level Quarter Date I. Lesson Title Ii. Most Essential Learning Competencies (Melcs) Iii. Content/Core Content
8 pages
Cp4252 ML All Units Notes
No ratings yet
Cp4252 ML All Units Notes
172 pages
Module 3
No ratings yet
Module 3
187 pages
Grade 10 Math: Measures of Position
No ratings yet
Grade 10 Math: Measures of Position
18 pages
Statistics and Probability
No ratings yet
Statistics and Probability
95 pages
Supplementary Exercise 3 - Summary Statistics
No ratings yet
Supplementary Exercise 3 - Summary Statistics
3 pages
Statistical Tables and Formula List
No ratings yet
Statistical Tables and Formula List
14 pages
A4 G10 Q4 Module 3 MELC-3
0% (1)
A4 G10 Q4 Module 3 MELC-3
8 pages
Grade 10 Math Diagnostic Test
100% (1)
Grade 10 Math Diagnostic Test
10 pages

Unit-2 Bda

Uploaded by

Unit-2 Bda

Uploaded by

UNIT – 2

PYTHON FOR DATA ANALYTICS

Pandas – Data Manipulation and Analysis

Loading and manipulating data with

Step 1: Install and Import Pandas

Then, import Pandas in your Python script or notebook:

For other formats:

# Display summary information

# Display basic statistics

# Select multiple columns

# Select rows by index

# Select rows based on condition

# Modify an existing column

Handling Missing Data

# Drop rows with missing values

# Fill missing values

Grouping and Aggregation

# Group by multiple columns and perform multiple aggregations

# Display first 5 rows

# Filter rows where 'column1' > 10

# Add a new column 'new_column' which is 'column2' squared

# Save the result to a new CSV file

What is Data Manipulation?

Key Features of Data Manipulation

Example 1: Filtering and Sorting:

Example 2: Aggregation and Summarisation:

DATA CLEANING AND IMPORTING

Import Pandas in your Python script or Jupyter Notebook:

To read data from a CSV file, use the pd.read_csv function:

To read data from an Excel file, use the pd.read_excel function:

# Replace with your actual database connection string

# Reading data from a SQL table

# Reading data using a SQL query

# Paths to your files

# Database connection string

# Reading data from CSV

# Reading data from Excel

# Reading data from SQL

DATA CLEANING TECHNIQUES

Handling missing values

 Row Deletion: Remove rows with missing values.

2. Impute Missing Values

 Mean/Median/Mode Imputation: Replace missing values with the mean (for

3. Use Algorithms that Support Missing Values

4. Flag and Fill

3. Handling Partial Duplicates

4. Combining Information from Duplicates

o IQR (Interquartile Range): Identifies outliers based on the range between

o Scatter Plot: Useful for identifying outliers in a two-dimensional dataset.

o Square Root Transformation: Similar to log transformation, but less

You might also like