0% found this document useful (0 votes)

7 views12 pages

ml report

The dataset 'International Sale Report.csv' contains details of international sales transactions, including attributes such as date, customer, SKU, quantity, rate, and gross amount, and is intended for exploratory data analysis. It is sourced from Kaggle and is structured for use with libraries like pandas, numpy, and seaborn for data manipulation and visualization. The document outlines data preprocessing steps, statistical measurements, outlier detection, and various visualization techniques to analyze sales trends and customer behavior.

Uploaded by

mohammedraiyanrs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views12 pages

ml report

Uploaded by

mohammedraiyanrs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Dataset Name: International sale Report.

csv

About the Dataset:

The dataset contains details of international sales transactions, including attributes like DATE, CUSTOMER,
SKU, PCS (quantity), RATE, and GROSS AMT (total sale value). This dataset is valuable for understanding sales
trends, detecting anomalies, and analyzing customer behavior.
The dataset is an unsupervised dataset as there is no defined target variable. It is used for exploratory data
analysis and visualization rather than prediction.

Source of the Dataset:

Kaggle is a free online platform owned by Google where people can find datasets, build machine learning
models, and share data science projects. It's widely used by beginners and professionals to learn, practice,
and compete in data-related challenges. It is a structured dataset containing detailed information about
International sale Report.csv

URL: https://www.kaggle.com/datasets/thedevastator/unlock-profits-with-e-commerce-sales-
data?select=International+sale+Report.csv

Data Preprocessing:
1. Importing Required Libraries:

• pandas: Used for data loading, cleaning, and manipulation in tabular format.
• Numpy: Provides support for efficient numerical and array-based operations.
• scipy.stats: Offers statistical tools like zscore for normalization and gmean for geometric mean.
• matplotlib.pyplot: Enables creation of basic data visualizations like line and bar charts.
• seaborn:Simplifies the creation of visually appealing and informative statistical plots.

# Import necessary libraries

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore, gmean

2. Loading the Dataset :

Loading the dataset into a pandas DataFrame and preview the first few rows.
output:

1. Dataset Dimensions Overview & Display Attributes names :

• The code uses the. shape attribute of the DataFrame, which returns a tuple representing the number of
rows and columns in the dataset — helping quickly understand its overall size
• The attributes (column names) of the CSV file are displayed using the pandas. DataFrame. columns
property to understand

output:

2. Removing Duplicate Entries & Handling Missing Values:

Duplicate rows are removed from the dataset to maintain data accuracy and avoid redundancy in analysis
To ensure clean and reliable data, null values in the dataset are identified using isnull().sum(). This prevents
errors in further analysis and improves model performance.

Output: Displays the total number of null values present in the dataset

3. statistical measurements:

• Mean: The average value of the selected column, which gives a general idea of the central tendency of the
data.

• Median: The middle value when the data is sorted, providing a better measure of central tendency when
the data has outliers.

• Mode: The most frequently occurring value in the column, indicating the most common rating given by
customers.
• Standard Deviation: Shows how spread out the values are around the mean; a higher standard deviation
indicates more variability in customer ratings.

• Min and Max: Represent the lowest and highest values in the column, helping to understand the range of
customer feedback.

output:

4. Z‐Score calculation:

Z-score measures how many standard deviations a data point is from the mean. It helps in identifying
outliers in the data — values with a Z-score above 3 or below -3 are typically considered unusual.

output:
5. Outlier Detection using Z‐score:
Outliers are data points that deviate significantly from other observations — they lie far from the
mean .They can arise due to variability in data, measurement errors, or unusual conditions.

6. Geometric Mean:
The geometric mean is a type of average that is especially useful for datasets with values that are
multiplicative or skewed, such as rates or percentages.

# Geometric Mean (only for positive values)

if (col_data > 0).all():
print(f"\nGeometric Mean of '{column}': {gmean(col_data):.2f}")
else:

print("\nGeometric Mean: Not applicable (contains zero or negative values)")

output:

Data Visualization:
1. Scatter Plot:

A scatter plot is a type of data visualization used to show the relationship between two numerical variables.
Each point on the scatter plot represents an individual data entry.

1. Pearson Correlation Coefficient & Covariance Matrix:

The Pearson correlation coefficient is a statistical measure that indicates the strength and direction of the
linear relationship between two numerical variables.It ranges from -1 to +1

The covariance matrix is a square matrix that displays the covariance values between pairs of variables in a
dataset. It helps to understand how two variables change together. A positive covariance indicates that the
variables tend to increase or decrease together
2. Heatmap:

A heatmap is a data visualization technique that uses color to represent the intensity or frequency of values
in a matrix or 2D grid. The most common use cases include:

3. BarPlot:
A bar plot is a graphical representation of data where rectangular bars represent the data values. The
length of each bar corresponds to the magnitude or frequency of the variable being represented. It's
commonly used for comparing different categories or tracking changes over time.
4. Pie Chart:
A pie chart is used to show the proportion of categories within a whole. Each segment of the pie represents a
part of the total, and the size of each segment corresponds to the proportion of that category.
5. Histogram:

A histogram is a type of bar chart used to represent the distribution of numerical data by dividing it into
bins (ranges) and showing the frequency of data points within each bin. It's mainly used for continuous data,
such as the distribution of age, temperature, or any other quantitative measure.
6. Box Plot:

A box plot (also known as a box-and-whisker plot) is a graphical representation used to summarize the
distribution of a dataset. It highlights key statistical properties, such as the median, quartiles, and potential
outliers. It's a great tool for understanding the spread and symmetry of data and for comparing multiple
distributions
7. Line Chart:

A line chart (or line graph) is a type of chart used to visualize data points in a time series or sequential
order, with the points connected by straight lines. It's a powerful tool for showing trends, patterns, and
changes over time, particularly when the data has a continuous flow

8. Pair Plot (Multivariate visualization):

A pair plot (also known as a scatterplot matrix) is a powerful multivariate visualization tool that helps to
explore relationships between multiple variables in a dataset. It provides a matrix of scatterplots, with
each plot showing the relationship between a pair of variables.

Professional Education Exam Drill
100% (4)
Professional Education Exam Drill
49 pages
CH 06 Test
No ratings yet
CH 06 Test
24 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
00. Data+Visualization+in+Python
No ratings yet
00. Data+Visualization+in+Python
17 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
DAUP Exam Notes -2in1
No ratings yet
DAUP Exam Notes -2in1
35 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Unit 4 - Data Visualization
No ratings yet
Unit 4 - Data Visualization
32 pages
Amit_Khilare_Used_Device_Data_PM_Project
No ratings yet
Amit_Khilare_Used_Device_Data_PM_Project
25 pages
Unit _Data Visualization
No ratings yet
Unit _Data Visualization
33 pages
Unit-5 new
No ratings yet
Unit-5 new
31 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
2.Program
No ratings yet
2.Program
8 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
Data Basics for ML
No ratings yet
Data Basics for ML
23 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
Plots of Matplotlib and Insights
No ratings yet
Plots of Matplotlib and Insights
5 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Data Visulization
No ratings yet
Data Visulization
2 pages
Lecture 2.1 Data_exploration
No ratings yet
Lecture 2.1 Data_exploration
22 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
unit-2
No ratings yet
unit-2
52 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
UNIT 4
No ratings yet
UNIT 4
42 pages
Comprehensive Data Visualization With Matplotlib and Seaborn
No ratings yet
Comprehensive Data Visualization With Matplotlib and Seaborn
40 pages
DV UNIT 2
No ratings yet
DV UNIT 2
5 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
MATPLOTLIB BASICS
No ratings yet
MATPLOTLIB BASICS
27 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
INDEX (1)
No ratings yet
INDEX (1)
16 pages
Unit 3 DATA VISUAIZATION
No ratings yet
Unit 3 DATA VISUAIZATION
25 pages
Prac - 6
No ratings yet
Prac - 6
7 pages
UNIT-2
No ratings yet
UNIT-2
36 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Basic Statistics
No ratings yet
Basic Statistics
2 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
ass-2 (2)
No ratings yet
ass-2 (2)
13 pages
DV
No ratings yet
DV
11 pages
analyse
No ratings yet
analyse
2 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Tableau Self Notes PDF
No ratings yet
Tableau Self Notes PDF
8 pages
Big data Analysis Presentation
No ratings yet
Big data Analysis Presentation
9 pages
dsbda_ut6
No ratings yet
dsbda_ut6
11 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Eda Presentation
No ratings yet
Eda Presentation
12 pages
Data Visualization For Python - Sales Retail - r1
No ratings yet
Data Visualization For Python - Sales Retail - r1
19 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
DMV Unit-4-1.pdf
No ratings yet
DMV Unit-4-1.pdf
10 pages
unit 4 actual notes BA
No ratings yet
unit 4 actual notes BA
24 pages
1714514135
No ratings yet
1714514135
12 pages
DV Lab Manual (Ex - No.1-10)
No ratings yet
DV Lab Manual (Ex - No.1-10)
23 pages
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
No ratings yet
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
38 pages
Data Exploration LEC3 AM
No ratings yet
Data Exploration LEC3 AM
59 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Sampling Distributions: Varsha Varde
No ratings yet
Sampling Distributions: Varsha Varde
13 pages
Math Question Paper 2011
No ratings yet
Math Question Paper 2011
2 pages
OG19 math新题解析
No ratings yet
OG19 math新题解析
69 pages
Group Comparision
No ratings yet
Group Comparision
49 pages
Lesson 4 Measure of Central Tendency or Position Activity 67
No ratings yet
Lesson 4 Measure of Central Tendency or Position Activity 67
3 pages
Automated Writing Evaluation in An EFL Setting: Lessons From China
No ratings yet
Automated Writing Evaluation in An EFL Setting: Lessons From China
30 pages
The Normal Binomial and Poisson Distributions
No ratings yet
The Normal Binomial and Poisson Distributions
25 pages
Ebony Thornton - Guided Notes - Measures of Center
No ratings yet
Ebony Thornton - Guided Notes - Measures of Center
5 pages
영어2 능률(김) 1~3과 어법, 어휘, 본문, 서술
No ratings yet
영어2 능률(김) 1~3과 어법, 어휘, 본문, 서술
7 pages
MCQs Unit 2 Measures of Central Tendency
100% (1)
MCQs Unit 2 Measures of Central Tendency
16 pages
Sequences and Series
No ratings yet
Sequences and Series
9 pages
ES209 Module 3 - Discrete Probability Distribution
No ratings yet
ES209 Module 3 - Discrete Probability Distribution
14 pages
Week 05
No ratings yet
Week 05
23 pages
Digital Image Processing Project Report PDF
No ratings yet
Digital Image Processing Project Report PDF
6 pages
DP2 AI SL Formative 3
No ratings yet
DP2 AI SL Formative 3
3 pages
Statistics 1st Paper (Class 11 12)
No ratings yet
Statistics 1st Paper (Class 11 12)
242 pages
Molecular Biology-Draft
No ratings yet
Molecular Biology-Draft
95 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
Statistics I (STA164)
No ratings yet
Statistics I (STA164)
7 pages
Ch08 Sampling Methods and The Central Limit Theorem
No ratings yet
Ch08 Sampling Methods and The Central Limit Theorem
13 pages
Scheme and Syllabus For Bba (Industry Integrated) Course (Specialization: Financial Services and Banking) (W.e.f Session 2019-2020)
No ratings yet
Scheme and Syllabus For Bba (Industry Integrated) Course (Specialization: Financial Services and Banking) (W.e.f Session 2019-2020)
53 pages
Manish 22222
100% (1)
Manish 22222
16 pages
Walmart Case Study
No ratings yet
Walmart Case Study
40 pages
Z - Chi Table
No ratings yet
Z - Chi Table
1 page
GE4 Module 3 Mathematics in The Modern World
No ratings yet
GE4 Module 3 Mathematics in The Modern World
26 pages
DS1 Sample Questions Set1
No ratings yet
DS1 Sample Questions Set1
6 pages
Lesson Plan Cot 3
100% (2)
Lesson Plan Cot 3
5 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
3 pages