0% found this document useful (0 votes)
7 views17 pages

computer project

The project presents a data analysis report on an e-commerce customer dataset, focusing on purchasing behavior and business insights. It includes system study, data cleaning, exploratory analysis, and highlights weak correlations among variables, with some positive trends for age and membership years on spending. The findings aim to inform marketing, customer relationship management, and sales strategies.

Uploaded by

sarah asif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

computer project

The project presents a data analysis report on an e-commerce customer dataset, focusing on purchasing behavior and business insights. It includes system study, data cleaning, exploratory analysis, and highlights weak correlations among variables, with some positive trends for age and membership years on spending. The findings aim to inform marketing, customer relationship management, and sales strategies.

Uploaded by

sarah asif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Introduction To Computer Application

Fall 2025

Term Project

Data Analysis Report

Submitted to:
Dr. Syed Irfan Nabi

Submitted by:

Syed Raaid Rizvi (32324)


Murtaqa Abbas (32993)
Sarah Asif (33076)

1
Table of contents
Chapter 1: System Study and Domain Analysis……………………………………………..3
1.1-Business Process and Analytics Applications……………………………………………………………….4
1.2-Data Composition……………………………………………………………………………………………………….4

Chapter 2: Data Cleaning………………………………………………………………………………5


2.1- Data Type Analysis and Conversion……………………………………………………………………………5
2.2- Handling Missing Values……………………………………………………………………………………………5

Chapter 3: Exploratory Analysis …………………………………………………………………..6


3.1- Univariate Analysis…………………………………………………………………………………………………..6
3.1.1- Numeric Variables……………………………………………………………………………………………...6
3.1.2- Categorical Variables………………………………………………………………………………………….10
3.2- Bivariate Analysis…………………………………………………………………………………………………….11
3.2.1- Heatmap of numerical data……………………………………………………………………………….11
3.2.2- Regression plots of numerical data……………………………………………………………………….12

Chapter 4: Summary…………………………………………………………………………………………………17

2
Chapter 1: System Study and Domain Analysis
This chapter sets the groundwork for our analysis by examining the domain of online customer
order management, the business context in which our dataset exists. The dataset represents a
small but meaningful slice of an e-commerce operation, capturing key details such as order IDs,
order dates, customer names, product types, quantities, prices, and shipping dates. We will
explore the types of analytics that can be performed on this dataset, such as identifying best-
selling products, analyzing customer purchasing patterns, examining revenue trends, or
detecting delays in order fulfillment. Finally, this chapter provides a detailed overview of the
dataset’s structure, highlighting the main attributes, their data types, and their relevance to
business analysis.

1.1- Business Process:


This dataset appears to represent customer behavior and purchase patterns for a retail or e-
commerce business. The main business process involved is customer relationship management
(CRM) and sales optimization, which include customer profiling, purchase behavior tracking,
customer segregation, marketing and promotion, and sales forecasting and inventory planning.

Types Of Analytics:
The types of analytics that can be performed on this dataset includes:
1. Descriptive Analytics: It helps in summarizing the customer demographics, spending
habits, and purchasing patterns, in the form of a graph or diagram, to help the business
understand its current customer base. It provides clear insights into who the customers
are, what they buy, and how they behave.
2. Segmentation Analysis: It groups customers into distinct clusters based on their
spending levels, purchase frequency, and product category preferences. This helps the
business target each segment with more personalized marketing, offers, and services.
3. Predictive Analysis: Predictive analytics uses past customer behavior, such as spending
patterns, purchase frequency, and loyalty, to forecast who is likely to make high-value
purchases or stop buying.
4. Diagnostic Analysis: This type of analysis investigates the underlying reasons behind
customer behavior by examining patterns and anomalies in the data. It helps the
business understand why certain customers spend less, reduce their purchase
frequency, or shift their category preferences.

3
Analytics Applications and How It Is Useful for a Business:
Analyzing this dataset can give important insights for several key business functions:
1. Sales and Marketing Optimization: By understanding customer spending scores,
preferred product categories, and purchase frequency, businesses can design targeted
marketing campaigns and promote products that match customer interests.
2. Customer Segmentation: The dataset helps identify distinct customer groups based on
income, spending behavior, loyalty, and demographics, allowing the business to tailor its
products, marketing and customer service to specific groups which leads to increased
customer satisfaction, better resource allocation and leading to a higher return on
investment.
3. Customer Relationship Management (CRM): Insights from membership years, last
purchase amount, and spending patterns help identify loyal, high-value customers
allowing businesses to offer personalized and efficient support leading to improved
customer relations which boosts sales and aids in better decision-making.
4. Product and Category Insights: Analyzing the preferred product categories across
different demographic groups helps the business understand which categories are most
profitable and where to focus future promotions.
1.2- Data Composition
Attribute Type Missing values Feature Importance
Id Integer (Numeric) 0 High
Age Float (Numeric) 6 Average
Gender Object (String) 6 Low
Income Float (Numeric) 6 High
Spending score Float (Numeric) 11 High
Member ship years Float (Numeric) 5 Average
Purchase frequency Float (Numeric) 7 Average
Preferred category Object (Categorical) 9 Low
Last purchase Float (Numeric) 7 High
amount

This table helps us understand the importance of each attribute to the business. Income,
spending score and last purchase amount is the most important as it directly influences
revenue, customer value, and business strategy. Whereas, age, membership Years and purchase
frequency are useful for segmentation, loyalty analysis, and sales forecasting but it is not
directly affecting crucial business decisions. Gender and preferred category are mainly used for
personalization and targeted marketing, however it’s not reliable to depend on these for
accurately predicting a customer’s behavior.

Chapter 2: Data Cleaning

4
Data cleaning is an essential first step in preparing any dataset for analysis. Before meaningful
insights can be extracted, the data must be checked for incorrect data types, missing values, and
inconsistencies that can interfere with statistical procedures. In this section, we carefully inspect
the dataset, correct data types where necessary, and apply appropriate strategies to handle
missing values. This ensures that the dataset is accurate, consistent, and ready for reliable
analysis in later stages.

2.1- Data Type Analysis and Conversion


The dataset included columns for ID, age, gender, income, spending score, membership years,
purchase frequency, preferred category, last purchase amount. The originally assigned data
types were int64, float64, object, float64, float64, float64, float64, object, float64 respectively.
Data types assigned to ID, gender, preferred category, last purchase amount are appropriate.
Change is required for the others as age, income, membership years should be whole numbers
so they should be changed to int64. All spending score observed in data is in the form of whole
numbers. On that basis we can assume that the score is given on a scale which does not
concern all rational numbers. Hence, spending score data type is also changed to int64 from
float64.

2.2- Handling Missing Values


Our strategy for handling missing and redundant data was based on discrepancies observed in
the dataset, and a general fix while ensuring that the data can be used for analyzing customer
trends.
1. ‘Age’ and ‘INCOME’ can be used to identify group trends in customers. We ensured both
are within an obvious acceptable range to remove data entry faults like negative ages or
income. All invalid and null values were filled with mean values as they would least
tamper with inferential analysis of the data.
2. For numeric data like 'spending score' and ‘purchase frequency’ that are required to
understand spending trends we used median values due to outliers which would
depreciate the accuracy of the mean.
3. For null values in demographic data like ‘gender’ we filled all empty elements with
‘Other’ to avoid creating incorrect trends in Male and Female categories. For ‘preferred
category’ we ensured no discrepancy in spellings by checking against value_counts()
function and manually replacing all errors. We also replaced null values with ‘Unknown’,
again to avoid inferential errors.
4. For customer relevant information like ‘membership years’ and ‘last purchase amount’
we applied a minimum limit of 1 to ensure all values entered were positive. All invalid
and null values were replaced with mean which was rounded to 2 decimal places for last
purchase amount and converted to integer for membership years to maintain
consistency.

5
Chapter 3: Exploratory Data Analysis
After the data cleaning, we now move forward to the next part of the section: Exploratory Data
Analysis, an important step in understanding the understanding the underlying structure and
characteristics of a dataset. It involves summarizing the main features using both numerical
measures and visualizations, allowing us to identify patterns, trends, and potential outliers.

3.1- Univariate Analysis:


This part focuses on univariate analysis; which helps us identify the shape of the data, whether
it is normal, skewed, or contains outliers. It highlights important characteristics such as mean,
median, mode, spread, and overall patterns. By analyzing one variable at a time using boxplots,
histograms, pie charts and count plots we will be able to better understand the behavior of each
variable and reach to more effective analysis of the data.

3.1.1- Numeric Values:


a) Income:
Income distribution is not normal, it is right-skewed because some customers have
extremely high incomes. the histogram shows that most customers have an income
between 70000 to 82000. The box-plot further shows that the median lies around
82000. There are no outliers in this data set as well.

6
b) Spending score:
Does not follow a normal distribution and is more evenly distributed, however it can be
observed it is slightly skewed left with the mean being around 50.68.

c) Membership Years:
The histogram and box-plot shows that the distribution is not normal and is slightly right-
skewed because many users have higher membership duration. The mean is 5.46, with the
range lies from 3 to 8.

7
d) Purchase Frequency:
There is a moderate spread but most customers fall between 15-40 purchases. The
distribution is not normal as there is mild left-skew showing that customer returns
slightly more frequently.

e) Last Purchase Amount:


Purchases go up to almost 1000 dollars with median lying slightly short of 500.
Interquartile range is from 190 to almost 800 and graph is very slightly right skewed.
The histogram shows that most customers have last purchased within the 0-100
region showing that the store has lots of customers coming in for daily groceries and
such items.

8
f) Age:
It is fairly spread across, however there is slight right-skewness as the upper age values (60-69)
stretch the tail. The interquartile range is from 30-55 and the median is around 42.

9
3.1.2- Categorical values:
a) Gender:
The data is relatively evenly distributed among the three groups as demonstrated by the bar
graph and pie chart, with slightly more individuals falling into the "Other" category .

b) Preferred Category
The bar-graph and pie chart shows us that sports is the most demanded category; however, it is
evenly distributed, with clothing and home & garden being the exact same.

10
3.2- Bivariate Analysis:
This part of the project explores the relationship between two variables at a time, allowing us to
move beyond simple descriptions and start understanding how different aspects of a dataset
interact. It helps reveal whether changes in one variable are associated with changes in another,
whether that relationship is positive, negative, strong, weak, or nonexistent. By using tools such
as heatmaps, regression plots and count plots we will be able to uncover trends, patterns, and
group differences that are not visible through univariate analysis alone.

3.2.1 Heatmap of numerical data:


Every intersection or box seen shows the correlation between two variables. Both axis are
identical, hence we can observe a diagonal dark red line. The dark red color shows a perfect
correlation which is occurring due to the variables interacting with themselves for example age
against age. The scale on the side shows that ranging from dark to red to dark blue the
correlations go from perfectly positive at red to moderately positive at white, and showing no
correlation at light blue, before continuing to negative correlation for dark blue.

11
In our map almost all the boxes are light blue showing most relationships are very weak.
Linearly speaking these values are observably independent of each other. So, we can say
knowing one’s income cannot help us predict their spending score.
he highest positive relationship is at 0.19 between id and membership years. This information is
quite arbitrary and cannot be used for inferential predictions or analysis. No meaningful insight
can be derived from this relation.
Some relationships like age against last purchase amount and membership years vs spending
score are slightly positive at 0.16 and 0.15 respectively. This shows that, though the link is very
weak, there is a slight possibility that customers of greater age spend more, as well as
customers that have been members for a longer period of time.
The most prominent negative correlation stands at -0.14 comparing income and purchase
frequency. This actually shows a slight tendency that with an increased income customers will
purchase less frequently from the stores. The dataset has minimal predictive power as there are
no strong linear correlations.

12
3.2.2 Count plots of non-numerical groups against
spending score
These plots show the distribution of spending score in different groups. Each bar shows a
distinct spending score value and the height of the bar show how many customers share that
spending score. Light colors show low spending scores while darker colors show higher.

1. Gender:

Female: There is a notable spike in the mid-range scores. This suggests females are likely to
spend in or near that middle range.
Male: There are several spikes in the higher score, indicating male customers are likely to spend
within mid high to high spending range.
Other: This category also shows high variability, showing no real trend. This category also
includes corrected data so accurate inference is not likely.

13
2. Preferred category:

Electronics: The most distinct feature is in the Electronics category. There is a very tall, thin bar
in the lighter color range. This shows that a lot of customers who prefer electronics share the
same, relatively low spending score.
Groceries & Clothing: These categories show a messy distribution with many short bars. This
means customers who buy these items have widely varying spending scores and there is very
low likeliness among them.
Sports: This category has a few notable bars in the low to mid-range, suggesting slight
consistency in the spending habits of such customers.
The count plot for these shows the high variability of scores reinforcing the conclusions from
the heatmap of very weak correlations between all variables.

14
3.2.3 Regression plots of numerical data against spending
score

Over all the plots depict nearly the same thing. The plots for income, purchase frequency, and
last purchase amount are homogenous. They all show a nearly horizontal blue line indicating
that they have no impact or driving force on spending score. Dots are all scattered randomly
showing no clear pattern.

15
The only signs of a pattern emerging are apparent in age and membership years. Membership
years has the most noticeable upward sloping line which shows that long term members are
likely to spend more. The same applies for age where there is also a slight upward slope
indicating that older customers are slightly more likely to spend more.

16
Chapter 4: Summary
The project analyzes an e-commerce customer dataset to understand purchasing behavior and
inform business decisions. Data cleaning ensured accuracy, corrected data types, and handled
missing values. Univariate analysis revealed distributions and trends in numeric variables like
income, spending score, and membership years, while categorical variables showed preferences
and demographics. Bivariate analysis explored relationships between variables using heatmaps,
count plots, and regression plots. Overall, most variables showed weak correlations, with slight
positive trends for age and membership years on spending, indicating that older and long-term
customers tend to spend more. Insights can guide marketing, CRM, and sales strategies.

17

You might also like