0% found this document useful (0 votes)
7 views12 pages

ml report

The dataset 'International Sale Report.csv' contains details of international sales transactions, including attributes such as date, customer, SKU, quantity, rate, and gross amount, and is intended for exploratory data analysis. It is sourced from Kaggle and is structured for use with libraries like pandas, numpy, and seaborn for data manipulation and visualization. The document outlines data preprocessing steps, statistical measurements, outlier detection, and various visualization techniques to analyze sales trends and customer behavior.

Uploaded by

mohammedraiyanrs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

ml report

The dataset 'International Sale Report.csv' contains details of international sales transactions, including attributes such as date, customer, SKU, quantity, rate, and gross amount, and is intended for exploratory data analysis. It is sourced from Kaggle and is structured for use with libraries like pandas, numpy, and seaborn for data manipulation and visualization. The document outlines data preprocessing steps, statistical measurements, outlier detection, and various visualization techniques to analyze sales trends and customer behavior.

Uploaded by

mohammedraiyanrs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Dataset Name: International sale Report.

csv

About the Dataset:


The dataset contains details of international sales transactions, including attributes like DATE, CUSTOMER,
SKU, PCS (quantity), RATE, and GROSS AMT (total sale value). This dataset is valuable for understanding sales
trends, detecting anomalies, and analyzing customer behavior.
The dataset is an unsupervised dataset as there is no defined target variable. It is used for exploratory data
analysis and visualization rather than prediction.

Source of the Dataset:


Kaggle is a free online platform owned by Google where people can find datasets, build machine learning
models, and share data science projects. It's widely used by beginners and professionals to learn, practice,
and compete in data-related challenges. It is a structured dataset containing detailed information about
International sale Report.csv

URL: https://www.kaggle.com/datasets/thedevastator/unlock-profits-with-e-commerce-sales-
data?select=International+sale+Report.csv

Data Preprocessing:
1. Importing Required Libraries:

• pandas: Used for data loading, cleaning, and manipulation in tabular format.
• Numpy: Provides support for efficient numerical and array-based operations.
• scipy.stats: Offers statistical tools like zscore for normalization and gmean for geometric mean.
• matplotlib.pyplot: Enables creation of basic data visualizations like line and bar charts.
• seaborn:Simplifies the creation of visually appealing and informative statistical plots.

# Import necessary libraries


import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore, gmean

2. Loading the Dataset :


Loading the dataset into a pandas DataFrame and preview the first few rows.
output:

1. Dataset Dimensions Overview & Display Attributes names :


• The code uses the. shape attribute of the DataFrame, which returns a tuple representing the number of
rows and columns in the dataset — helping quickly understand its overall size
• The attributes (column names) of the CSV file are displayed using the pandas. DataFrame. columns
property to understand

output:

2. Removing Duplicate Entries & Handling Missing Values:


Duplicate rows are removed from the dataset to maintain data accuracy and avoid redundancy in analysis
To ensure clean and reliable data, null values in the dataset are identified using isnull().sum(). This prevents
errors in further analysis and improves model performance.

Output: Displays the total number of null values present in the dataset

3. statistical measurements:

• Mean: The average value of the selected column, which gives a general idea of the central tendency of the
data.

• Median: The middle value when the data is sorted, providing a better measure of central tendency when
the data has outliers.

• Mode: The most frequently occurring value in the column, indicating the most common rating given by
customers.
• Standard Deviation: Shows how spread out the values are around the mean; a higher standard deviation
indicates more variability in customer ratings.

• Min and Max: Represent the lowest and highest values in the column, helping to understand the range of
customer feedback.

output:

4. Z‐Score calculation:

Z-score measures how many standard deviations a data point is from the mean. It helps in identifying
outliers in the data — values with a Z-score above 3 or below -3 are typically considered unusual.

output:
5. Outlier Detection using Z‐score:
Outliers are data points that deviate significantly from other observations — they lie far from the
mean .They can arise due to variability in data, measurement errors, or unusual conditions.

6. Geometric Mean:
The geometric mean is a type of average that is especially useful for datasets with values that are
multiplicative or skewed, such as rates or percentages.

# Geometric Mean (only for positive values)


if (col_data > 0).all():
print(f"\nGeometric Mean of '{column}': {gmean(col_data):.2f}")
else:

print("\nGeometric Mean: Not applicable (contains zero or negative values)")

output:

Data Visualization:
1. Scatter Plot:

A scatter plot is a type of data visualization used to show the relationship between two numerical variables.
Each point on the scatter plot represents an individual data entry.

1. Pearson Correlation Coefficient & Covariance Matrix:

The Pearson correlation coefficient is a statistical measure that indicates the strength and direction of the
linear relationship between two numerical variables.It ranges from -1 to +1

The covariance matrix is a square matrix that displays the covariance values between pairs of variables in a
dataset. It helps to understand how two variables change together. A positive covariance indicates that the
variables tend to increase or decrease together
2. Heatmap:

A heatmap is a data visualization technique that uses color to represent the intensity or frequency of values
in a matrix or 2D grid. The most common use cases include:

3. BarPlot:
A bar plot is a graphical representation of data where rectangular bars represent the data values. The
length of each bar corresponds to the magnitude or frequency of the variable being represented. It's
commonly used for comparing different categories or tracking changes over time.
4. Pie Chart:
A pie chart is used to show the proportion of categories within a whole. Each segment of the pie represents a
part of the total, and the size of each segment corresponds to the proportion of that category.
5. Histogram:

A histogram is a type of bar chart used to represent the distribution of numerical data by dividing it into
bins (ranges) and showing the frequency of data points within each bin. It's mainly used for continuous data,
such as the distribution of age, temperature, or any other quantitative measure.
6. Box Plot:

A box plot (also known as a box-and-whisker plot) is a graphical representation used to summarize the
distribution of a dataset. It highlights key statistical properties, such as the median, quartiles, and potential
outliers. It's a great tool for understanding the spread and symmetry of data and for comparing multiple
distributions
7. Line Chart:

A line chart (or line graph) is a type of chart used to visualize data points in a time series or sequential
order, with the points connected by straight lines. It's a powerful tool for showing trends, patterns, and
changes over time, particularly when the data has a continuous flow

8. Pair Plot (Multivariate visualization):

A pair plot (also known as a scatterplot matrix) is a powerful multivariate visualization tool that helps to
explore relationships between multiple variables in a dataset. It provides a matrix of scatterplots, with
each plot showing the relationship between a pair of variables.

You might also like