UNIT – 2
PYTHON FOR DATA ANALYTICS
Python is a general purpose language and is often used for things other than data
analysis and data science. What makes Python extremely useful for working with
data?
There are libraries that give users the necessary functionality when crunching data.
Below are the major Python libraries that are used for working with data. You should
take some time to familiarize yourself with the basic purposes of these packages.
Numpy
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with
other low level languages like Fortran, C and C++.
Pandas – Data Manipulation and Analysis
Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.
Loading and manipulating data with
Pandas DataFr ames
Loading and manipulating data with Pandas DataFrames is a crucial step in data analysis with
Python. Here are some basic steps to load and manipulate data with Pandas DataFrames.
1. Loading data: You can load data into a Pandas DataFrame from various sources such as CSV
files, Excel files, SQL databases, and APIs. You can use
the read_csv(), read_excel(), read_sql(), and read_json() functions in
Pandas to read data from different sources.
2. Exploring data: Once you load data into a DataFrame, you can explore it using various
functions such as head(), tail(), describe(), info(), shape, columns,
and dtypes. These functions provide basic information about the DataFrame, such as the
column names, data types, and summary statistics.
3. Cleaning data: Data cleaning is an essential step in data analysis to ensure data quality. You
can clean data using various functions such as dropna(), fillna(), replace(),
and drop_duplicates(). These functions help you handle missing values, duplicate
rows, and inconsistent data.
4. Manipulating data: You can manipulate data in a DataFrame using functions such
as groupby(), pivot_table(), merge(), and concat(). These functions allow you
to group data, pivot tables, and combine data from multiple sources.
5. Visualizing data: You can use Pandas’ built-in visualization tools to create various plots such
as bar plots, line plots, scatter plots, and histograms. These plots help you visualize the data
and gain insights into data trends.
6. Exporting data: Once you analyze and manipulate data, you may need to export the results to
various file formats such as CSV, Excel, or SQL databases. You can use
the to_csv(), to_excel(), to_sql(), and to_json() functions in Pandas to export
data.
Loading and manipulating data using Pandas DataFrame in Python is a common task for data
analysis. Here's a brief overview of how to accomplish this:
Step 1: Install and Import Pandas
First, ensure you have Pandas installed. If not, you can install it using pip:
sh
Copy code
pip install pandas
Then, import Pandas in your Python script or notebook:
python
Copy code
import pandas as pd
Step 2: Load Data
Pandas can read data from various file formats such as CSV, Excel, SQL, and more. The
most common method is reading from a CSV file:
python
Copy code
df = pd.read_csv('path_to_your_file.csv')
For other formats:
python
Copy code
# Excel
df = pd.read_excel('path_to_your_file.xlsx')
# SQL
from sqlalchemy import create_engine
engine = create_engine('your_database_connection_string')
df = pd.read_sql('your_table_name', engine)
Step 3: Basic DataFrame Operations
Once the data is loaded, you can perform various operations on the DataFrame.
Viewing Data
python
Copy code
# Display the first few rows
print(df.head())
# Display summary information
print(df.info())
# Display basic statistics
print(df.describe())
Selecting Data
python
Copy code
# Select a single column
col = df['column_name']
# Select multiple columns
cols = df[['column_name1', 'column_name2']]
# Select rows by index
row = df.iloc[0] # First row
rows = df.iloc[0:5] # First 5 rows
# Select rows based on condition
filtered_df = df[df['column_name'] > value]
Adding/Modifying Columns
python
Copy code
# Add a new column
df['new_column'] = value
# Modify an existing column
df['existing_column'] = df['existing_column'].apply(lambda x: x + 1)
Handling Missing Data
python
Copy code
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df = df.dropna()
# Fill missing values
df = df.fillna(value)
Grouping and Aggregation
python
Copy code
# Group by a column and calculate the mean of each group
grouped_df = df.groupby('column_name').mean()
# Group by multiple columns and perform multiple aggregations
agg_df = df.groupby(['column1', 'column2']).agg({'column3': 'mean', 'column4': 'sum'})
Step 4: Save Data
After manipulation, you might want to save the DataFrame back to a file:
python
Copy code
# Save to CSV
df.to_csv('path_to_save_file.csv', index=False)
# Save to Excel
df.to_excel('path_to_save_file.xlsx', index=False)
# Save to SQL
df.to_sql('your_table_name', engine, if_exists='replace', index=False)
Example: Complete Workflow
Here is a complete example that loads data, manipulates it, and saves the result:
python
Copy code
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Display first 5 rows
print(df.head())
# Filter rows where 'column1' > 10
filtered_df = df[df['column1'] > 10]
# Add a new column 'new_column' which is 'column2' squared
filtered_df['new_column'] = filtered_df['column2'] ** 2
# Group by 'column1' and calculate the mean of 'new_column'
grouped_df = filtered_df.groupby('column1')['new_column'].mean().reset_index()
# Save the result to a new CSV file
grouped_df.to_csv('processed_data.csv', index=False)
What is Data Manipulation?
Data manipulation in data science is a crucial step that helps unfold patterns that eventually help make
informed decisions. Simply, it refers to modifying, transforming, or reorganising data to extract
meaningful insights, prepare it for analysis, or meet specific requirements.
Key Features of Data Manipulation
Data Filtering
Data filtering:
It is crucial for manipulating data and extracting pertinent insights from raw datasets. Filtering
streamlines analysis processes by selectively isolating specific data points or patterns, which enhances
efficiency. It ensures that only relevant information contributes to informed decision-making and
insightful discoveries.
Data Sorting:
Sorting data or structuring it into columns and rows enhances readability and comprehension.
Analysts can quickly identify patterns, outliers, and trends by organising data logically, streamlining
the analysis process. This structured presentation aids in extracting meaningful insights and making
informed decisions based on a clear understanding of the data.
Data Aggregation:
Aggregation, a vital data manipulation feature, condenses multiple records into a concise summary in
data science. It encompasses computing averages, summations, accounting totals, and identifying
maximum or minimum values. It streamlines analysis processes and yields actionable insights from
complex datasets.
Example 1: Filtering and Sorting:
One fundamental data manipulation task is filtering and sorting. It involves selecting specific rows or
columns based on certain criteria and arranging the data in order. For instance, in a customer database,
you might filter the records only to include customers who purchased in the last month and then sort
them based on their total spending.
Example 2: Aggregation and Summarisation:
Another essential aspect of data manipulation is aggregating and summarising data. It involves
calculating summary statistics, such as a specific variable’s average, sum, minimum, or maximum
values. For instance, in sales data, you might aggregate the total revenue generated per product
category or calculate the monthly average sales.
DATA CLEANING AND IMPORTING
Reading data from various sources such as CSV files, Excel files, and databases is a common
task in data analysis. Here is how you can accomplish this using Pandas in Python.
Step 1: Set Up the Environment
Make sure you have Pandas installed. If not, install it using pip:
sh
Copy code
pip install pandas
Step 2: Import Pandas
Import Pandas in your Python script or Jupyter Notebook:
python
Copy code
import pandas as pd
Reading Data from CSV
To read data from a CSV file, use the pd.read_csv function:
python
Copy code
# Reading data from a CSV file
csv_file_path = 'path_to_your_file.csv'
df_csv = pd.read_csv(csv_file_path)
print(df_csv.head())
Reading Data from Excel
To read data from an Excel file, use the pd.read_excel function:
python
Copy code
# Reading data from an Excel file
excel_file_path = 'path_to_your_file.xlsx'
df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1') # Specify
sheet name if necessary
print(df_excel.head())
Reading Data from Databases
To read data from a database, you will need to use SQLAlchemy to establish a connection.
Install SQLAlchemy if you haven't:
sh
Copy code
pip install sqlalchemy
Then, use the create_engine function to establish a connection and pd.read_sql to read
the data:
python
Copy code
from sqlalchemy import create_engine
# Replace with your actual database connection string
database_connection_string =
'dialect+driver://username:password@host:port/database'
# Create an engine
engine = create_engine(database_connection_string)
# Reading data from a SQL table
table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print(df_sql.head())
# Reading data using a SQL query
sql_query = 'SELECT * FROM your_table_name WHERE some_column > some_value'
df_sql_query = pd.read_sql(sql_query, con=engine)
print(df_sql_query.head())
Example: Combined Data Reading
Here's an example of reading data from CSV, Excel, and SQL sources in one script:
python
Copy code
import pandas as pd
from sqlalchemy import create_engine
# Paths to your files
csv_file_path = 'path_to_your_file.csv'
excel_file_path = 'path_to_your_file.xlsx'
# Database connection string
database_connection_string =
'dialect+driver://username:password@host:port/database'
engine = create_engine(database_connection_string)
# Reading data from CSV
df_csv = pd.read_csv(csv_file_path)
print('CSV Data:')
print(df_csv.head())
# Reading data from Excel
df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1')
print('Excel Data:')
print(df_excel.head())
# Reading data from SQL
table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print('SQL Data:')
print(df_sql.head())
DATA CLEANING TECHNIQUES
Handling missing values
it is a critical step in data preprocessing and can significantly impact the performance of
machine learning models. There are several strategies to deal with missing data:
1. Remove Missing Values
Row Deletion: Remove rows with missing values.
o Useful when the dataset is large and missing values are few.
Column Deletion: Remove columns with missing values.
o Appropriate when a column has a high percentage of missing values and is
less important.
2. Impute Missing Values
Mean/Median/Mode Imputation: Replace missing values with the mean (for
numerical data), median, or mode (for categorical data) of the column.
o Simple and fast, but can distort the data distribution.
Forward/Backward Fill: Replace missing values with the previous/next value in the
column.
o Suitable for time series data.
Interpolation: Use linear or polynomial interpolation for numerical data.
K-Nearest Neighbors (KNN) Imputation: Replace missing values with the average
value of the nearest neighbors.
o Takes into account the similarity between data points but is computationally
expensive.
Multivariate Imputation by Chained Equations (MICE): Uses the relationships
between features to predict missing values.
o More sophisticated and accurate but also computationally intensive.
3. Use Algorithms that Support Missing Values
Some machine learning algorithms can handle missing values natively, such as
decision trees, Random Forests, and XGBoost.
4. Flag and Fill
Create a new binary column to indicate whether the value was missing and then fill
the missing value using one of the imputation methods.
Preserves information about missingness, which can be useful for some models.
5. Prediction Models
Use machine learning models to predict and impute missing values based on other
features.
HANDLING DUPLICATES:
Dealing with duplicates is a crucial aspect of data cleaning to ensure the quality and
reliability of the dataset. Here are some techniques to handle duplicates effectively:
1. Identifying Duplicates
Exact Duplicates: Rows that are completely identical across all columns.
Partial Duplicates: Rows that are identical in specific key columns but may have different
values in other columns.
2. Removing Duplicates
Remove Exact Duplicates: Use functions to identify and remove rows that are exact
duplicates.
3. Handling Partial Duplicates
Prioritize Based on a Column: Keep the duplicate row based on the value of a particular
column (e.g., latest timestamp).
Aggregation: Aggregate duplicate rows by summarizing or averaging numerical values.
Manual Review: Sometimes manual review is necessary for critical data.
4. Combining Information from Duplicates
Merge Information: Combine information from duplicate rows, especially if different rows
have different useful information.
HANDLING OUTLIERS:
Handling outliers is an essential part of data preprocessing, as outliers can significantly
impact the performance of statistical analyses and machine learning models. Here are several
techniques to identify and handle outliers:
Identifying Outliers
1. Statistical Methods
o Z-Score: Measures the number of standard deviations a data point is from the
mean.
python
Copy code
from scipy.stats import zscore
import numpy as np
df['z_score'] = zscore(df['column_name'])
outliers = df[np.abs(df['z_score']) > 3]
o IQR (Interquartile Range): Identifies outliers based on the range between
the first quartile (Q1) and the third quartile (Q3).
python
Copy code
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) |
(df['column_name'] > upper_bound)]
2. Visual Methods
o Box Plot: Displays the distribution of data and highlights outliers.
python
Copy code
import matplotlib.pyplot as plt
df['column_name'].plot.box()
plt.show()
o Scatter Plot: Useful for identifying outliers in a two-dimensional dataset.
python
Copy code
df.plot.scatter(x='column_x', y='column_y')
plt.show()
3. Model-Based Methods
o Isolation Forest: Identifies anomalies by isolating observations.
python
Copy code
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.1)
df['anomaly_score'] =
iso_forest.fit_predict(df[['column_name']])
outliers = df[df['anomaly_score'] == -1]
Handling Outliers
1. Removing Outliers
o Simply remove the outliers identified by statistical or visual methods.
python
Copy code
df_cleaned = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]
2. Transforming Data
o Log Transformation: Reduces the impact of outliers.
python
Copy code
df['log_column'] = np.log(df['column_name'] + 1) # Adding 1 to
avoid log(0)
o Square Root Transformation: Similar to log transformation, but less
aggressive.
python
Copy code
df['sqrt_column'] = np.sqrt(df['column_name'])
3. Capping/Flooring
o Limit the values of outliers to the nearest threshold (e.g., the upper or lower bound
of IQR).
python
Copy code
df['column_name'] = np.where(df['column_name'] > upper_bound,
upper_bound,
np.where(df['column_name'] <
lower_bound, lower_bound, df['column_name']))
4. Imputation
o Replace outliers with statistical measures like mean or median.
python
Copy code
median = df['column_name'].median()
df['column_name'] = np.where((df['column_name'] < lower_bound)
| (df['column_name'] > upper_bound), median, df['column_name'])
5. Model-Based Methods
o Use robust algorithms that are less sensitive to outliers, such as decision trees or
robust regression models.