ml report
ml report
csv
URL: https://www.kaggle.com/datasets/thedevastator/unlock-profits-with-e-commerce-sales-
data?select=International+sale+Report.csv
Data Preprocessing:
1. Importing Required Libraries:
• pandas: Used for data loading, cleaning, and manipulation in tabular format.
• Numpy: Provides support for efficient numerical and array-based operations.
• scipy.stats: Offers statistical tools like zscore for normalization and gmean for geometric mean.
• matplotlib.pyplot: Enables creation of basic data visualizations like line and bar charts.
• seaborn:Simplifies the creation of visually appealing and informative statistical plots.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
output:
Output: Displays the total number of null values present in the dataset
3. statistical measurements:
• Mean: The average value of the selected column, which gives a general idea of the central tendency of the
data.
• Median: The middle value when the data is sorted, providing a better measure of central tendency when
the data has outliers.
• Mode: The most frequently occurring value in the column, indicating the most common rating given by
customers.
• Standard Deviation: Shows how spread out the values are around the mean; a higher standard deviation
indicates more variability in customer ratings.
• Min and Max: Represent the lowest and highest values in the column, helping to understand the range of
customer feedback.
output:
4. Z‐Score calculation:
Z-score measures how many standard deviations a data point is from the mean. It helps in identifying
outliers in the data — values with a Z-score above 3 or below -3 are typically considered unusual.
output:
5. Outlier Detection using Z‐score:
Outliers are data points that deviate significantly from other observations — they lie far from the
mean .They can arise due to variability in data, measurement errors, or unusual conditions.
6. Geometric Mean:
The geometric mean is a type of average that is especially useful for datasets with values that are
multiplicative or skewed, such as rates or percentages.
output:
Data Visualization:
1. Scatter Plot:
A scatter plot is a type of data visualization used to show the relationship between two numerical variables.
Each point on the scatter plot represents an individual data entry.
The Pearson correlation coefficient is a statistical measure that indicates the strength and direction of the
linear relationship between two numerical variables.It ranges from -1 to +1
The covariance matrix is a square matrix that displays the covariance values between pairs of variables in a
dataset. It helps to understand how two variables change together. A positive covariance indicates that the
variables tend to increase or decrease together
2. Heatmap:
A heatmap is a data visualization technique that uses color to represent the intensity or frequency of values
in a matrix or 2D grid. The most common use cases include:
3. BarPlot:
A bar plot is a graphical representation of data where rectangular bars represent the data values. The
length of each bar corresponds to the magnitude or frequency of the variable being represented. It's
commonly used for comparing different categories or tracking changes over time.
4. Pie Chart:
A pie chart is used to show the proportion of categories within a whole. Each segment of the pie represents a
part of the total, and the size of each segment corresponds to the proportion of that category.
5. Histogram:
A histogram is a type of bar chart used to represent the distribution of numerical data by dividing it into
bins (ranges) and showing the frequency of data points within each bin. It's mainly used for continuous data,
such as the distribution of age, temperature, or any other quantitative measure.
6. Box Plot:
A box plot (also known as a box-and-whisker plot) is a graphical representation used to summarize the
distribution of a dataset. It highlights key statistical properties, such as the median, quartiles, and potential
outliers. It's a great tool for understanding the spread and symmetry of data and for comparing multiple
distributions
7. Line Chart:
A line chart (or line graph) is a type of chart used to visualize data points in a time series or sequential
order, with the points connected by straight lines. It's a powerful tool for showing trends, patterns, and
changes over time, particularly when the data has a continuous flow
A pair plot (also known as a scatterplot matrix) is a powerful multivariate visualization tool that helps to
explore relationships between multiple variables in a dataset. It provides a matrix of scatterplots, with
each plot showing the relationship between a pair of variables.