computer project
computer project
Fall 2025
Term Project
Submitted to:
Dr. Syed Irfan Nabi
Submitted by:
1
Table of contents
Chapter 1: System Study and Domain Analysis……………………………………………..3
1.1-Business Process and Analytics Applications……………………………………………………………….4
1.2-Data Composition……………………………………………………………………………………………………….4
Chapter 4: Summary…………………………………………………………………………………………………17
2
Chapter 1: System Study and Domain Analysis
This chapter sets the groundwork for our analysis by examining the domain of online customer
order management, the business context in which our dataset exists. The dataset represents a
small but meaningful slice of an e-commerce operation, capturing key details such as order IDs,
order dates, customer names, product types, quantities, prices, and shipping dates. We will
explore the types of analytics that can be performed on this dataset, such as identifying best-
selling products, analyzing customer purchasing patterns, examining revenue trends, or
detecting delays in order fulfillment. Finally, this chapter provides a detailed overview of the
dataset’s structure, highlighting the main attributes, their data types, and their relevance to
business analysis.
Types Of Analytics:
The types of analytics that can be performed on this dataset includes:
1. Descriptive Analytics: It helps in summarizing the customer demographics, spending
habits, and purchasing patterns, in the form of a graph or diagram, to help the business
understand its current customer base. It provides clear insights into who the customers
are, what they buy, and how they behave.
2. Segmentation Analysis: It groups customers into distinct clusters based on their
spending levels, purchase frequency, and product category preferences. This helps the
business target each segment with more personalized marketing, offers, and services.
3. Predictive Analysis: Predictive analytics uses past customer behavior, such as spending
patterns, purchase frequency, and loyalty, to forecast who is likely to make high-value
purchases or stop buying.
4. Diagnostic Analysis: This type of analysis investigates the underlying reasons behind
customer behavior by examining patterns and anomalies in the data. It helps the
business understand why certain customers spend less, reduce their purchase
frequency, or shift their category preferences.
3
Analytics Applications and How It Is Useful for a Business:
Analyzing this dataset can give important insights for several key business functions:
1. Sales and Marketing Optimization: By understanding customer spending scores,
preferred product categories, and purchase frequency, businesses can design targeted
marketing campaigns and promote products that match customer interests.
2. Customer Segmentation: The dataset helps identify distinct customer groups based on
income, spending behavior, loyalty, and demographics, allowing the business to tailor its
products, marketing and customer service to specific groups which leads to increased
customer satisfaction, better resource allocation and leading to a higher return on
investment.
3. Customer Relationship Management (CRM): Insights from membership years, last
purchase amount, and spending patterns help identify loyal, high-value customers
allowing businesses to offer personalized and efficient support leading to improved
customer relations which boosts sales and aids in better decision-making.
4. Product and Category Insights: Analyzing the preferred product categories across
different demographic groups helps the business understand which categories are most
profitable and where to focus future promotions.
1.2- Data Composition
Attribute Type Missing values Feature Importance
Id Integer (Numeric) 0 High
Age Float (Numeric) 6 Average
Gender Object (String) 6 Low
Income Float (Numeric) 6 High
Spending score Float (Numeric) 11 High
Member ship years Float (Numeric) 5 Average
Purchase frequency Float (Numeric) 7 Average
Preferred category Object (Categorical) 9 Low
Last purchase Float (Numeric) 7 High
amount
This table helps us understand the importance of each attribute to the business. Income,
spending score and last purchase amount is the most important as it directly influences
revenue, customer value, and business strategy. Whereas, age, membership Years and purchase
frequency are useful for segmentation, loyalty analysis, and sales forecasting but it is not
directly affecting crucial business decisions. Gender and preferred category are mainly used for
personalization and targeted marketing, however it’s not reliable to depend on these for
accurately predicting a customer’s behavior.
4
Data cleaning is an essential first step in preparing any dataset for analysis. Before meaningful
insights can be extracted, the data must be checked for incorrect data types, missing values, and
inconsistencies that can interfere with statistical procedures. In this section, we carefully inspect
the dataset, correct data types where necessary, and apply appropriate strategies to handle
missing values. This ensures that the dataset is accurate, consistent, and ready for reliable
analysis in later stages.
5
Chapter 3: Exploratory Data Analysis
After the data cleaning, we now move forward to the next part of the section: Exploratory Data
Analysis, an important step in understanding the understanding the underlying structure and
characteristics of a dataset. It involves summarizing the main features using both numerical
measures and visualizations, allowing us to identify patterns, trends, and potential outliers.
6
b) Spending score:
Does not follow a normal distribution and is more evenly distributed, however it can be
observed it is slightly skewed left with the mean being around 50.68.
c) Membership Years:
The histogram and box-plot shows that the distribution is not normal and is slightly right-
skewed because many users have higher membership duration. The mean is 5.46, with the
range lies from 3 to 8.
7
d) Purchase Frequency:
There is a moderate spread but most customers fall between 15-40 purchases. The
distribution is not normal as there is mild left-skew showing that customer returns
slightly more frequently.
8
f) Age:
It is fairly spread across, however there is slight right-skewness as the upper age values (60-69)
stretch the tail. The interquartile range is from 30-55 and the median is around 42.
9
3.1.2- Categorical values:
a) Gender:
The data is relatively evenly distributed among the three groups as demonstrated by the bar
graph and pie chart, with slightly more individuals falling into the "Other" category .
b) Preferred Category
The bar-graph and pie chart shows us that sports is the most demanded category; however, it is
evenly distributed, with clothing and home & garden being the exact same.
10
3.2- Bivariate Analysis:
This part of the project explores the relationship between two variables at a time, allowing us to
move beyond simple descriptions and start understanding how different aspects of a dataset
interact. It helps reveal whether changes in one variable are associated with changes in another,
whether that relationship is positive, negative, strong, weak, or nonexistent. By using tools such
as heatmaps, regression plots and count plots we will be able to uncover trends, patterns, and
group differences that are not visible through univariate analysis alone.
11
In our map almost all the boxes are light blue showing most relationships are very weak.
Linearly speaking these values are observably independent of each other. So, we can say
knowing one’s income cannot help us predict their spending score.
he highest positive relationship is at 0.19 between id and membership years. This information is
quite arbitrary and cannot be used for inferential predictions or analysis. No meaningful insight
can be derived from this relation.
Some relationships like age against last purchase amount and membership years vs spending
score are slightly positive at 0.16 and 0.15 respectively. This shows that, though the link is very
weak, there is a slight possibility that customers of greater age spend more, as well as
customers that have been members for a longer period of time.
The most prominent negative correlation stands at -0.14 comparing income and purchase
frequency. This actually shows a slight tendency that with an increased income customers will
purchase less frequently from the stores. The dataset has minimal predictive power as there are
no strong linear correlations.
12
3.2.2 Count plots of non-numerical groups against
spending score
These plots show the distribution of spending score in different groups. Each bar shows a
distinct spending score value and the height of the bar show how many customers share that
spending score. Light colors show low spending scores while darker colors show higher.
1. Gender:
Female: There is a notable spike in the mid-range scores. This suggests females are likely to
spend in or near that middle range.
Male: There are several spikes in the higher score, indicating male customers are likely to spend
within mid high to high spending range.
Other: This category also shows high variability, showing no real trend. This category also
includes corrected data so accurate inference is not likely.
13
2. Preferred category:
Electronics: The most distinct feature is in the Electronics category. There is a very tall, thin bar
in the lighter color range. This shows that a lot of customers who prefer electronics share the
same, relatively low spending score.
Groceries & Clothing: These categories show a messy distribution with many short bars. This
means customers who buy these items have widely varying spending scores and there is very
low likeliness among them.
Sports: This category has a few notable bars in the low to mid-range, suggesting slight
consistency in the spending habits of such customers.
The count plot for these shows the high variability of scores reinforcing the conclusions from
the heatmap of very weak correlations between all variables.
14
3.2.3 Regression plots of numerical data against spending
score
Over all the plots depict nearly the same thing. The plots for income, purchase frequency, and
last purchase amount are homogenous. They all show a nearly horizontal blue line indicating
that they have no impact or driving force on spending score. Dots are all scattered randomly
showing no clear pattern.
15
The only signs of a pattern emerging are apparent in age and membership years. Membership
years has the most noticeable upward sloping line which shows that long term members are
likely to spend more. The same applies for age where there is also a slight upward slope
indicating that older customers are slightly more likely to spend more.
16
Chapter 4: Summary
The project analyzes an e-commerce customer dataset to understand purchasing behavior and
inform business decisions. Data cleaning ensured accuracy, corrected data types, and handled
missing values. Univariate analysis revealed distributions and trends in numeric variables like
income, spending score, and membership years, while categorical variables showed preferences
and demographics. Bivariate analysis explored relationships between variables using heatmaps,
count plots, and regression plots. Overall, most variables showed weak correlations, with slight
positive trends for age and membership years on spending, indicating that older and long-term
customers tend to spend more. Insights can guide marketing, CRM, and sales strategies.
17