0% found this document useful (0 votes)
109 views52 pages

REPORT ON DATA ANALYTICS - Docx NANMA

Uploaded by

aryaremani3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views52 pages

REPORT ON DATA ANALYTICS - Docx NANMA

Uploaded by

aryaremani3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

REPORT ON

CONVERSION TREND LINE


AT
EXPOSYS DATA LABS

Submitted to
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
In Partial requirement for the award of the
Degree of
BACHELOR OF VOCATIONAL
IN
BUSINESS PROCESS AND DATA ANALYTICS
By
NANMA K V
(Reg No: 25121021)
Under guidance of
Mrs. SUJI JOSE
(Assistant Professor, DDUKK CUSAT)

DDU KAUSHAL KENDRA


YEAR 2024-2025
CERTIFICATE FROM THE ORGANIZATION
CERTIFICATE FROM THE DEPARTMENT

DEEN DAYAL UPADHAY KAUSHAL KENDRA


COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY

KOCHI- 682022

This is to certify that this is a report on “Conversion Trendline” is a bonafide study of


NANMA K V, sixth semester Bachelor of Vocation in Business Process and Data Analytics
student of DDU KAUSHAL Kendra at Cochin University of Science And Technology.

Mrs. Suji Jose Prof. (Dr.) Santhosh Kumar


S.

Supervising Guide, Director, DDUKK

Assistant Professor, DDUKK

Place: Ernakulam

Date:
DECLARATION

I, NANMA K V hereby declare that the report entitled “ “ is a bonafide study I have done in
the organization, submitted to DDU KAUSHAL Kendra, CUSAT is a record of work done by
me under the guidance of Mrs. Suji Jose, Assistant Professor, DDU KAUSHAL Kendra in
partial fulfillment of the requirement for the award of the degree of the Bachelor of
Vocational in Business Process and Data Analytics, I also declare that this report is my
original work and has not previously formed on the basis for the award of any Degree,
Diploma, Associateship, Fellowship or any other similar title of this or any other university or
institution.

NANMA K V
SIGNATURE

Reg No. 25121021

Place: Ernakulam

Date:
ACKNOWLEDGEMENT

I would like to acknowledge my profound sense of gratitude to God Almighty for giving me
the strength, patience, and ability to complete the study and make this report on time.

I am immensely grateful to Mr. Vishnu Vardhan, Project Manager of EXPOSYS DATA


LABS, and my support at the organizations for this study, for their valuable assistance and
continuous support in sharing their knowledge to complete the study.

My sincere gratitude goes to my guide, Mrs. Suji Jose, Asst. Professor, DDU KAUSHAL
Kendra, CUSAT, who has guided me in every step of this study and is a source of
encouragement throughout this study. Her valuable suggestions and constant support have
allowed me to complete my report.

I would also like to thank the department, especially our Director Prof. (Dr.) Santhosh Kumar
S., and all other faculty for giving me such incredible exposure during my degree.
ABSTRACT
CHAPTER 1
BUSINESS
UNDERSTANDING
1.1 ABOUT THE COMPANY

Exposys Data Labs is a startup operating in Bangalore that aims to solve real-world business
challenges through innovative technologies. Established by a team of the best experts with
diverse backgrounds and skills in research, technology, and business, the company thrives on
creativity, collaboration, and a relentless pursuit of excellence. The core mission of the
company is to address issues, seize opportunities, and prototype solutions using cutting-edge
technologies such as Artificial Intelligence, Machine Learning, Deep Learning, and Data
Science. The company aims to become a reputed organization over the upcoming years.

Driven by a passion for pushing boundaries and embracing emerging technologies, Exposys
Data Labs specializes in addressing challenges faced across different industries, which
include energy, robotics, defense, security, and healthcare.

The solutions provided by the company are tailored in a manner that will meet the needs of a
specific client, ensuring maximum impact and value creation. moreover, they focus on
building healthy customer relationships with its customer which helps in rectifying any error
or modification of any previous solutions provided by the organization. With a culture of
innovation and a commitment to continuous improvement, Exposys Data Labs is poised to be
a leader in the rapidly evolving landscape of technology-driven business solutions. By
staying ahead of the curve and embracing change, we empower our clients to thrive in an
increasingly digital world, driving efficiency, profitability, and competitive advantage.

SERVICES

1. Energy
They are revolutionizing the energy sector with our BRICK energy solution, harnessing
the power of solar and wind energy to provide cost-effective and sustainable alternatives.
With BRICK Energy, clients can significantly reduce their energy bills, paving the way
for a future where electricity becomes a thing of the past.
2. Humanoid
Humanoid robots are capable of handling various domestic tasks, from ordering
groceries to teaching children. With an affordable price tag and a two-year replacement
warranty, our humanoid eliminates the need for additional helpers, streamlining
household management.
3. Defense
Drone surveillance device offers state-of-the-art monitoring capabilities for streets and
borders, equipped with solar technology for extended operations. With its shooting
capabilities, it ensures enhanced security measures, potentially reducing the reliance on
traditional military forces.
4. Blockchain Technology
Harness the power of blockchain technology for efficient record-keeping and database
management across various industries. Blockchain solutions offer secure and transparent
transactional systems, paving the way for streamlined operations and enhanced trust.
5. Medical
Liquid GPS medical chip that monitors human and animal health, controlled by a
centralized e-diagnosis center. With daily notifications about one's health status, our
solution aims to revolutionize healthcare, potentially reducing the dependency on
traditional medical practitioners.
6. E-Security
BPSS (Behaviour Pattern Security Socket) creates personalized online DNA IDs for
internet users, with online behavior serving as the password. By eliminating the need to
remember passwords, our solution enhances online security and user experience.
7. AI and ML Projects
Specialize in AI and ML projects that empower businesses with advanced capabilities
such as spam filters, product recommenders, and fraud detectors. By leveraging these
technologies and thus enabling clients to make data-driven decisions and stay ahead in
today's competitive landscape.

1.2 PROJECT BACKGROUND

The Project is a “Conversion rate trendline analysis” that helps to Marketing department of an
organization. For this project, I designed to develop a predictive future conversion rate for an
e-commerce website. Because the output of an e-commerce website is conversion rate. So,
the objective of the marketing department is to increase the conversion rate.
1.3 INTRODUCTION

Predicting the future conversion rate is a very important requirement for an e-commerce
website. The conversion rates are the imported matrix which is considered to analyse the
growth of an e-commerce platform. The primary objective of this project is to develop a
predictive model capable of forecasting the future conversion rate for an e-commerce
website. This project is aimed at empowering the Marketing department of an organization
with the proper planning, and strategies to reduce waste in the marketing department.

Waste in the marketing department can impact the conversion rate of an e-commerce website.
Understanding the conversion rate serves as a critical indicator of the website’s performance
and the efficiency of the marketing efforts. The objective is to increase conversion rates and
thereby drive business growth and profitability. Analyzing the future conversion rate by using
historical data can be very useful for reducing wastage and increasing the conversion rate.

The waste in the marketing department is due to the reactive marketing strategies followed by
the organization. for each campaign and promotion, they have to invest their time, cost, and
human resources and most of these marketing efforts are not receiving the expected targets or
outputs in the form of conversion rate, the project helps in reducing the waste, and implement
a more proactive marketing method for proper planning and to implement new strategies.

To address this challenge, I am creating a model to predict the future conversion rate, which
helps in building an adaptive marketing, planning, and deep understanding of customer
behavior. I have also integrated an alert system by using association rules. The project covers
different steps in a data analytics project, which includes Data Understanding, Data
Preprocessing, Data visualization, Model building, and Model Evaluation.

1.4 OBJECTIVE

The project aims to provide the marketing department with a powerful tool for strategic
planning and decision-making. By predicting future conversion rates, the marketing
department can better allocate resources, and campaigns thereby reducing wastage and
leading to cost savings and proactive planning and strategies.

 Analyse the trend of future Conversion rates.


 Develop an alert system integrated with the prediction system.
 The system should alert for promotions, campaigns, and suitable promotion and
campaign strategies from historical user data.

1.3 EXISTING PROCESS (AS-IS PROCESS)

The marketing department in the organization follows a reactive marketing culture, where
this strategy is characterized by short-term focus, limited strategic planning, ad hoc decision-
making, and a lack of emphasis on data-driven insights and continuous optimization, that is
the campaigns launched by the marketing department is for an immediate need or
circumstances, which means the marketing department doesn’t plan proactively to take any
decision upon the occurrence of an event whether it’s a depression or increase in the
conversion rates.

This type of planning often requires sudden decision making which restricts the new strategy
to use for a long-term period, without a throughout analysis or consideration of long-term
implications which leads to inconsistency and inefficiency in marketing efforts. Reactive
marketing may not fully leverage available data analytics to inform decision-making. Instead
of using data to anticipate trends and plan, marketing decisions are often based on intuition or
immediate observations. Not only does it affect the long-term adaptability of a campaign but
also there may be less emphasis on targeted audience segmentation and personalized
messaging. The campaigns will be broad and generic, potentially missing opportunities to
engage specific customer segments effectively. Due to the focus on short-term objectives,
there may be limited emphasis on continuously optimizing strategies and tactics. Without a
proactive approach, the marketing department fails to adapt to changing market conditions or
customer preferences, without a proactive approach, the investments made for each
marketing effort become a waste, and there will be a waste of time, resources, cost, etc. So it
is necessary to reduce waste in the marketing department and make the planning and
strategies more adaptable and long-term.
CHAPTER 2
DATA UNDERSTANDING
2.1 DATA COLLECTION

The data used for the project is from an E-commerce website, the dataset contains individual
details regarding the users, and this data is scraped from the website itself. This data is then
stored in the format of a CSV file, the data not only contains user details but also contains the
specific interaction metrics of each individual on the website. The dataset did not provide any
data regarding the E-commerce platform or organization. The data only provide insights into
the user transaction and their interaction, which makes the data user-specific rather than
website-oriented, so this data can only be used for this specific project, which can be used for
the analysis of this specific E-commerce website, this data cannot be used for projects that are
related to website analysis, as the data is user-oriented which restrict generalisability of any
analysis gained from a website analysis. As the data collected is multivariate time series data
of two years, the dataset is typically large, and the dataset provides insights into the variety
and volume of data that can be gained from a single website.

2.2 DATA UNDERSTANDING

The data collected from scrapping may contain outliers, inconsistencies, and null values,
before building any model it is necessary to clean the data to ensure the integrity of the model
and this step helps in increasing the accuracy of the model and the generalizability too.
Before starting with this process, the first and initial stage of the project is to understand the
data, we need to understand the volume, data type, and distribution of the data. The dataset
has 50000, rows and 12 features the target variable of the initial dataset is the conversion
status of a visitor. This dataset is further integrated to develop a new dataset for conversion
rate analysis. The previous target variable was categorical value which is transformed into
continuous value. The conversion rate is found by aggregating the details of users of a single
data, and the conversion date for corresponding days is found and stored into a new data
frame, the other features for the new dataset is, male to female ratio, most visited age
category, average scroll depth, average session duration. The new dataset only has categorical
values as age category which includes four age groups, the distribution of age is from 15 to
45, the feature male to female ratio depicts the ratio between the male and female, the
average scroll depth is the average scroll depth obtained from n number of users who have
visited the website, same applicable for average session duration too. The conversion rate
ranges from 0 to 1, 0 depicts that out of n visitors, no one has been converted to a customer,
and 1 depicts n number of visitors are converted to customers. The new dataset for
conversion trendline analysis has 665 rows and 5 columns. This consists of conversion rate
data from 2022 to 2023 which is 2 years of data, which is enough for a trendline analysis.

Fig 2.2 Dataset Shape and Data Types.

2.3 DESCRIPTIVE STATISTICS

Descriptive statistics serves as the basic step to understanding the data. They provide
summaries about the features which might include, mean, median, minimum, and standard
deviation. These measures help in understanding the center of data and its dispersion. These
statistics help to grasp the characteristics of the data without making any instant assumptions
regarding the population or trying to analyze any relationship between the features.

2.3.1 MEASURING CENTRAL TENDENCY

Measuring central tendency is a way to describe the center or typical value of a set of data.
There are three main measures of central tendency: mean, median, and mode. Each of these
measures gives a different perspective on the center of the data, and they are used in various
situations depending on the nature of the data and the context of the analysis.

Mean (Average)

 The mean is calculated by adding up all the values in a data set and then dividing by
the number of values.
 Mean=Sum of all values / Number of values
 The mean can be sensitive to outliers or extreme values in the data, which can skew
the result.

Median

 The median is the middle value of a data set when the values are arranged in
ascending or descending order.
 If there is an odd number of observations, the median is the middle number. If there is
an even number of observations, the median is the average of the two middle
numbers.

Mode

 A data set may have one mode (unimodal), two modes (bimodal), or more than two
modes (multimodal), or it may have no mode if all values occur with the same
frequency.
 The mode is the value that appears most frequently in a data set.

2.3.2 MEASURING DISPERSION

Measuring dispersion is crucial for understanding the variability of the data points in each
feature. It can indicate the extent to which data points are scattered around the central values.
This also highlights the presence of any outliers within the data. Outliers can skew the
analysis and affect the reliability of statistical conclusions.

Range

 The simplest measure of dispersion.


 Statistical measure of the minimum and maximum value of the data is found.
 It doesn’t provide any robust indication of variability.
Variance

 The average of the squared differences between each data point and the mean.
 Provide the measure regarding how far each data point is from the mean value.

Standard Deviation

 The square root of the variance.


 It is compatible with the original units of the data.
 Ability to summarize dispersion around the mean.

Fig 2.3 Descriptive Statistics


CHAPTER 3
DATA PREPARATION
3.2 DATA CLEANING

The first step introduced before creating the new dataset is data preprocessing on the original
dataset. Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in
the data analysis process. It involves identifying and correcting errors, inconsistencies, and
missing values in the dataset to ensure its accuracy and reliability for analysis. Data cleaning
encompasses several tasks, including handling missing values by either imputing or removing
them, removing duplicate records to ensure each observation is unique, and correcting
inconsistent data such as spelling errors or formatting inconsistencies.

Additionally, data cleaning involves validating the dataset to ensure it meets analysis
requirements and is free from errors or inconsistencies. The ultimate goal of data cleaning is
to prepare the dataset in a standardized and consistent format, making it easier to analyze,
interpret, and derive meaningful insights.

The cleaning process consists of different steps that ensure the data can be used for model
building, it increases the accuracy and reliability of the data, each cleaning process has its
own importance to ensure data quality. These steps ultimately increase the accuracy of the
model too, because the poor quality of the data used for modeling can disrupt the accuracy of
the model.

3.2.1 HANDLING MISSING VALUES

This step is essential for data preprocessing, it ensures the accuracy, reliability, and integrity
of data, thereby enhancing the quality of analyses, improving model performances, and
facilitating more informed decision-making processes. It identifies and manages null values
within a dataset by the explicit representation of NAN or Null. Proper handling of missing
values not only finds the null values but also appropriately manages to maintain the reliability
and accuracy of the dataset. One can use the function isnull().sum() to find the total number
of null values in the dataset.
Fig

3.2.1.1 Identifying the null values

After identifying the null values in a dataset, the next step is to remove the null values by
using data imputation or deletion, etc. In the data imputation methods involve filling missing
values with statistical measures like mean, median, or mode. The imputation method depends
on the type of feature or variable in a dataset. For numeric features, the mean and median are
used, and for categorical or object features, the mode is used for imputation.

Fig 3.2.1.2 Handling missing values

But in certain circumstances when we have a large dataset, and if it’s not a time series data,
one can drop the entire row containing the null value, but it's not appropriate for most of the
project as it affects the generalisability of the model.

On the other hand, when there are no null values in the dataset, we can avoid the imputation
method and focus on the data validation to verify accuracy, consistency, and completeness.
3.2.2 HANDLING REDUNDANCY

The aim of handling redundancy is to maximize storage economy and minimize storage
expenses. Redundant data occupies unnecessary storage space, especially in large datasets or
databases. By identifying and removing redundancy, organizations can significantly reduce
storage costs and improve the performance of their databases by reducing the volume of data
to be processed. It ensures accuracy in data analysis by reducing data bias and errors that may
arise from duplicate or inconsistent data. It can be removed redundancy to append similar
values or deletion of the row. It helps in maintaining data compliance with regulations and
standards by ensuring data accuracy, consistency, and integrity.

3.2.3 HANDLING OUTLIERS

Handling outliers is crucial for maintaining data quality, ensuring accurate analysis, and
improving the performance of statistical models. Outlier detection is a very important step
because the dataset has numerous inconsistencies and errors, so they can identify the
inconsistencies and abnormal data and wash off these anomalies and it turns into a healthy
dataset. The outliers are detected by using a boxplot, which can be labeled in a graphical
representation that displays the distribution of a dataset along with its median, quartiles, and
potential outliers.

In some projects, outliers have their own importance, as they can depict the limitation or new
scope of the project. The outlier detection is important as values have a large deviation from
the mean value, the model will decide the prediction by mostly considering these outliers, so
one needs to identify the outliers and remove them using any proper methods.

3.3 EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis (EDA) is a crucial initial step in the data analysis process, aimed
at understanding the main characteristics, patterns, and relationships within a dataset before
applying complex statistical models or machine learning algorithms. It helps to inform data
cleaning, feature engineering, and model selection decisions by providing a comprehensive
understanding of the dataset's characteristics and informing the appropriate strategies for
handling and analyzing the data effectively.

3.3.1 UNIVARIATE ANALYSIS

Univariate analysis is a statistical method used to examine and describe the distribution,
central tendency, and variability of a single variable in a dataset without considering
relationships with other variables. It provides a basic understanding of the characteristics of
individual variables and is often the first step in exploratory data analysis. In univariate
analysis, various statistical measures and graphical techniques are employed to summarize
and visualize the data. It is an essential component of data analysis in various fields including
statistics, economics, social sciences, and data sciences. It helps to identify potential issues,
outliers, or interesting features in the data that may require further investigation or
preprocessing before advanced analyses or modeling can be performed.

3.3.1.1 BOX PLOT ANALYSIS

Box plot analysis, also known as whisker plot analysis, is a graphical representation used to
visualize the distribution, central tendency, and variability of numerical data. The box plot
provides a concise summary of the data's five-number summary: minimum, first quartile
(Q1), median (Q2 or 50th percentile), third quartile (Q3), and maximum.

Components of a box plot:

 Median(Q2):
The middle value of the dataset when it's sorted in ascending order. It divides the data
into two halves, with 50% of the data points lying below and 50% above the median.
 Interquartile Range (IQR):
The range between the first quartile (Q1) and third quartile (Q3), represents the
middle 50% of the data. It provides a measure of the spread or variability of the data.
 Whiskers:
Lines extending from the box to the minimum and maximum data values, excluding
outliers. Whiskers can be calculated in different ways, but commonly used methods
are the 1.5*IQR rule or Tukey's fences.
 Outliers:
Data points that fall outside the whiskers and are considered to be unusually high or
low compared to the rest of the data.

The box plot for numerical features in the dataset is visualized, this helps in identifying the
outliers in the data, and their range and means values. The feature with the most outlier is
session duration and it's important to know the range of this outlier as the outlier in the
session duration and scroll depth are important while considering the features regarding a
website.

Fig

3.3.1.2 DISTRIBUTION PLOT

Distribution plots provide a visual summary of the distribution of data, showing the
frequency or density of values. Understanding the distribution of data is crucial for various
data analysis tasks such as identifying patterns, detecting outliers, and making predictions. By
looking at the shape of the distribution, you can infer information about the central tendency
(mean, median, mode) and the spread (variance, standard deviation) of the data. For example,
a symmetric distribution with a single peak suggests that the data is concentrated around a
central value, while a skewed distribution indicates asymmetry.
For many statistical analyses, it's important to know whether the data follows a normal
distribution. Distribution plots can help assess normality by visually comparing the shape of
the distribution to a normal distribution curve (bell-shaped curve).

The same three features are distribution plotted where we can see the age and scroll depth
almost follow a normal distribution with the least left or right skewness, but the session
duration doesn’t follow a bell curve it mostly has left-skewed data.

Fig

3.3.1.2 BAR PLOT

A bar plot is a common visualization tool used in univariate analysis to display the
distribution of categorical variables. It represents the frequency or proportion of each
category within the variable by using rectangular bars. Each bar's height corresponds to the
frequency or proportion of the category it represents.
Fig
From the visualization, it is visible that the number of male visitors is almost twice of female
visitors.

The bar plot for the features wish list 1, wish list 2, most searched category 1, and most
searched category 2. This helps in identifying the best products on the website and finding
which products should be focused.

Fig
Fig
Fig

3.3.2 BIVARIATE ANALYSIS

Bivariate analysis is a statistical method used to analyze the relationship between two
variables. Unlike univariate analysis, which focuses on analyzing a single variable, bivariate
analysis examines how the values of two variables are related to each other. It seeks to
understand whether there is a relationship, association, or correlation between the two
variables.

In bivariate analysis, the focus is on understanding the nature and strength of the relationship
between the two variables. For continuous variables, a scatter plot can provide insights into
the direction (positive or negative) and strength of the relationship. Correlation coefficients
quantify the strength of this relationship, ranging from -1 to 1.

3.3.2.1 SCATTER PLOT

A scatter plot is a graphical representation of bivariate data, meaning it displays the


relationship between two variables. Each point on the plot represents a single observation,
with one variable plotted on the x-axis and the other variable plotted on the y-axis. Scatter
plots are widely used in bivariate analysis because they provide a visual way to explore the
relationship between two continuous variables.

Scatter plots allow you to visualize the relationship between two variables. By examining the
pattern formed by the points, you can gain insights into how the variables are related.

If there is a clear pattern or trend in the points, it suggests that the variables are related in
some way. For example, points may form a straight line, indicating a linear relationship, or
they may follow a curved pattern, suggesting a nonlinear relationship.

Fig

So, we can see a bin of 3 age categories for the distribution of scroll depth, the age group of
25 to 35 has a large value in scroll depth. In summary, scatter plots are valuable tools in
bivariate analysis for visually exploring the relationship between two continuous variables.
They provide insights into the direction, strength, and patterns of association between
variables, helping researchers to better understand the underlying relationships in their data.

3.3.2.2 BAR PLOT


A barplot is a graphical representation commonly used in bivariate analysis to explore the
relationship between two categorical variables. In this analysis, each variable corresponds to
one axis of the plot. The bars represent the frequency or proportion of observations falling
into each category of the variables. By examining the bar plot, one can discern patterns,
differences, and associations between the variables.

When creating a bar plot for bivariate analysis, one common approach is to generate separate
plots for each categorical variable to visualize their distributions. This allows for an initial
understanding of the composition and frequency of each category within the dataset.
Additionally, bar plots can be structured to display the conditional distribution of one variable
within each category of the other variable. This method enables a comparison of how the
distribution of one variable varies across the levels of the other variable.

Furthermore, grouped or stacked bar plots can be utilized to directly compare the frequencies
or proportions of one variable across different categories of the other variable. These plots
provide insights into any disparities or consistencies in the distribution of one variable based
on the levels of the other variable.

By examining the bar plot, analysts can identify associations, trends, and patterns between the
categorical variables, facilitating a deeper understanding of their interrelationship within the
dataset. In this figure, the male category more purchased clothing items and the female
category purchased personal care and beauty. Overall, bar plots serve as effective visual tools
for bivariate analysis, aiding in the exploration and interpretation of categorical data.
3.3.2.2 CORRELATION MATRIX

It's a square matrix where each cell represents the correlation coefficient between two
variables. The correlation coefficient, typically denoted by "r," ranges from -1 to 1, indicating
the strength and direction of the relationship. A value of 1 signifies a perfect positive
correlation, meaning when one variable increases, the other also increases proportionally.
Conversely, a value of -1 represents a perfect negative correlation, indicating that when one
variable increases, the other decreases proportionally. A correlation of 0 suggests no linear
relationship between the variables. This aids in understanding how changes in one variable
may influence another, aiding in decision-making processes. Moreover, it helps to detect
multicollinearity, where two or more variables are highly correlated, which can impact the
reliability of statistical models.

3.4 DATA ENCODING

Data encoding refers to converting data from one form to another, typically to facilitate
storage, transmission, or processing. In the context of data analysis and machine learning,
encoding is often used to transform categorical or textual data into a numerical format that
can be more easily processed by algorithms and models. The data types provided by data.
dtypes, we can identify the need for encoding certain columns to transform them into a
numerical format suitable for analysis and modeling. The data encoding should be ensured in
any machine learning project, as the machine can only interpret binary language (1 and 0), we
have to convert categorical values which are text. The popular method used for encoding is
by using the Label encoder() function in the Python library. This method is useful when the
data contains a large number of unique categorical values within a feature, but while using
this method one cannot identify the integer assigned for each unique categorical value.

This can be avoided when we manually assign the integer value for unique categorical values.
The map function can be used for this manual process but before that, we need to identify the
unique values in the feature using the unique() function. After identification, a unique integer
value should be assigned to the categorical values.

Fig 3.4 Data Encoding using Label Encoder

3.5 FEATURE GENERATION

The initial target variable was categorical value which is transformed into continuous value.
The conversion rate is found by aggregating the details of users of a single data, and the
conversion date for corresponding days is found and stored into a new data frame, the other
features for the new dataset is, male to female ratio, most visited age category, average scroll
depth, average session duration. The new dataset only has categorical values as age category
which includes four age groups, the distribution of age is from 15 to 45, the feature male to
female ratio depicts the ratio between the male and female, the average scroll depth is the
average scroll depth obtained from n number of users who have visited the website, same
applicable for average session duration too. The conversion rate ranges from 0 to 1, 0 depicts
that out of n visitors, no one has been converted to a customer, and 1 depicts out n number of
visitors n number of visitors are converted to a customer. The new dataset for conversion
trendline analysis has 665 rows and 5 columns. This consists of conversion rate data from
2022 to 2023 which is 2 years of data, which is enough for a trendline analysis.

The first feature for the new dataset is the date, even though the initial dataset has a feature
date, users visiting on the same day have redundancy in the feature date, so we only need to
collect the starting and the ending date without any redundancy. For that, the function
unique() is used in the column “date”, in the initial dataset. This helps in gathering the unique
data from the initial dataset and it is stored in a new data frame. While performing this the
data is converted to datetime data type to avoid aggregation errors.

Fig 3.5.1 Generation of Column date

Here we can see that we got the output as date from 01-03-2022 to 17-11-2023 as expected.
We can see that there are 627 rows which means time series data of 627 days, which is
approximately two years.

The next feature to be generated is male to female ratio, it's simply the ratio between the
number of males to females. This can help in demographic analysis as this can create insights
into the relation between gender and conversion rates. To create this feature, first, we have to
find the total number of male visitors and female visitors in a day. Then we have to find the
ratio.

Male to female ratio = number of male visitors in a day/number of female visitors in a day.

To find the number of visitors in each day according to gender, first, the data is grouped
according to the day and gender, and the count of each gender is taken and stored in a new
data frame, then a new data frame is constructed by implementing the above equation for
each day. As there is redundancy in data, the resulting data frame will have the size of the
initial dataset, to avoid this unique value for each day is stored in the new data frame

“male_to_female”.

Fig 3.5.2 Generation of column male_ to_female

While considering the demographic feature gender ratio it is also essential to consider the
age, but the initial dataset has age as a continuous value, even when we try to consider the
most visited age in a day, it will not be appropriate, so the initial continuous value is
converted into different bins. Bins were declared for creating the age group, so now the
numeric continuous value will be converted to a categorical value with 4 unique values. The
created bins are 0-18, 19-25, 26-35, 36-45, 46-55, 56-65, and 65+.
The initial value in the age will be matched with the bin and the corresponding age group will
be marked for each visitor, then the age group will be grouped according to the date, and the
mode of age group will be taken, which provides the most visited age group of a specific day.
This new value is stored in a new data frame “age_group”.

Fig 3.5.3 Generation of column age_group

After considering the demographic features, the most important features are related to the
website. Like session duration and scroll depth, both this feature in the initial dataset has user
specific value for each day. So, the scroll depth is converted to average scroll depth in a day,
the data is grouped by date, and the mean value of the scroll depth for the specific day is
found and stored into a new data frame “scroll depth”
Fig 3.5.4 Generation of column scroll depth

Using the same logic the average session duration for a day is found and stored in data frame
“Session Duration”

Fig 3.5.5 Generation of column Session Duration

The project is concentrated in Conversion rate, and its visible that the initial dataset have no
column related to conversion rate, but we can see that the purchase status of each visitor is
given so, the conversion rate is found by dividing the total customers by total visitors. For
this we have to find the total number of visitors for the specific day and number of visitors
having purchase status as “Bought”. The conversion rate value ranges from 0 to 1. The value
0 indicates that there is no conversion and 1 indicates entire visitors are converted into
customers.

While generating feature initially the number of visitors per day and customers are taken into
account, but it is not considered as a feature in the project. The fault in the feature generation
can be detected, when the data frame contains any value greater than 1 and less than 0, as this
is the expected range of value in the column.

Conversion rate= Number of visitors converted to customer/ Total number of visitors

The data is grouped by date and the number of visitors and customers are found and stored
into new column, but there will be redundancy in conversion rate value for the users visited
on same day. To avoid this unique value for each day is selected. The other method is to avoid
duplication for conversion rate in a day and store in new data frame. Thus, the conversion
rate is found and moved to the data frame “conversion rate”.

Fig 3.5.6 Generation of column Conversion Rate

3.6 DATA AGGREGATION

Data aggregation is an essential step in data processing and analysis that involves gathering,
summarizing, and consolidating data from multiple sources or datasets to produce a
meaningful and consolidated view of the data. It helps in reducing complexity, improving
data quality, enabling analysis and interpretation, supporting decision-making, and enhancing
reporting and visualization, contributing to more effective and informed data-driven decision-
making and business strategies.

Initially, the dataset had 50000 data points which contained the user data of visitors for each
data, from that the new dataset was created by aggregation of each feature in the dataset. the
conversion rate for each date was found and it is stored in a new data frame, other features
like, male to female ratio, average scroll depth, average session duration, and most visited age
category for each day was found and it is aggregated with the new data frame containing the
conversion rate. This new data frame will be used for the conversion rate trend line analysis.

The features for this dataset were created through feature engineering on the initial dataset.
this reduced the size of the initial dataset to 626 data points which is data worth 2 years.
Which is more than enough for a regular trend line analysis.

Fig 3.6 Aggregation of generated features

The above features are aggregated using the pd.concat() function which helps in joining data
frame, for this easier methods the features generated earlier was stored in separate data
frames. Now the resulting data frame (new dataset) will have 6 columns or six features
including the conversion rate and date.

3.7 FEATURE SCALING

Feature scaling is a crucial preprocessing step in machine learning that standardizes the range
of independent variables or features in the data. It's essential because many machine learning
algorithms are sensitive to the scale of the input features. Features with larger scales can
disproportionately influence the model, potentially leading to poor performance or longer
training times. If the dataset is ready, then choose a scaling method based on the data’s
characteristics. The method includes standard scaler, min-max scaling, robust scaling, and
normalization. After selecting a method, they fit the scaler on the data and then transform it.

If you have separate training and test/validation datasets, it's crucial to use the same scaler
fitted on the training data to transform the test/validation data to maintain consistency.
Finally, the scaled data can be used as input for machine learning algorithms like regression,
classification, or clustering models to train and make predictions.

3.7.1 NORMALIZATION

Normalization is a feature scaling technique used to adjust the values of features to a


common scale, ensuring they have similar magnitudes. It helps prevent features with larger
magnitudes from dominating the learning process and ensures that all features contribute
equally to the model’s prediction. Additionally, normalization preserves the relative
relationships between feature values, maintaining the underlying structure and patterns in the
data. This preprocessing step is especially useful for algorithms that rely on distance metrics,
such as k-nearest neighbors (KNN) or clustering algorithms, where the distance between data
points is crucial for determining similarities or differences. The formula for normalization
divides each feature value by the square root of the sum of squares of all feature values in that
vector.

Normalizedi=∑j=1nxj2xi

3.7.2 STANDARDIZATION

Standardization is also known as Z-score normalization, is a feature scaling technique that


transforms features to have a mean of 0 and a standard deviation of 1. This technique centres
the feature values around zero and scales them based on the standard deviation, making it
suitable for algorithms that assume features are normally distributed and have a similar scale.
It is particularly useful when the features have different units or scales, as it brings them to a
comparable scale and helps algorithms converge faster and perform better. Moreover,
standardization does not change the shape or distribution of the original data, preserving the
inherent structure and patterns that might be present in the dataset.

Z-Score=(x-mean)/standard deviation

3.8 DATA SPLITTING

The initial splitting is to convert the dataset into x and y which is features and the target
variable, this done using allocating the index of the columns. After that these two datasets
should be converted into respective train and test datasets.

The available dataset is divided into two or three subsets: training, validation, and test sets.
The training set is used to train the model, while the validation set is used to tune
hyperparameters and assess the model's performance during training. The test set remains
untouched during the entire training and validation process and is used only once to evaluate
the final model's performance. When splitting the data, it's crucial to maintain the same
distribution of classes or labels across all subsets, especially for imbalanced datasets. This
ensures that the model learns from a balanced representation of the data and can generalize
well to unseen examples. The process of splitting the data should be done randomly to ensure
that each subset is representative of the overall dataset and captures its underlying patterns.

Generally traintestsplit() function is used for data splitting which helps in splitting the dataset
into train and test samples effortlessly. Using this function one can input the size of train data
or test data. Not only the size one can assign the random state for not choosing the data in
index order, moreover it helps in shuffling which ensure the generalisability of model.

In this project the data is split into test and train by manually assigning the index of the data,
traditional iloc method is being used for this. The last 70 rows are assigned for testing and
rest is taken for training the models.
CHAPTER 4
MODEL BUILDING
4.1 ARIMA MODEL

AutoRegressive (AR): This component captures the relationship between an observation and
a number of lagged observations (i.e., its own past values). The order of the autoregression
(p) specifies how many lagged observations are included in the model.

Integrated (I): This component represents the differencing of the raw observations to achieve
stationarity. Stationarity is essential because many time series models, including ARIMA,
assume that the statistical properties of the series remain constant over time. The order of
differencing (d) specifies how many times differencing is performed to make the series
stationary.

Moving Average (MA): This component models the dependency between an observation and
a residual error from a moving average model applied to lagged observations. The order of
the moving average (q) specifies how many lagged forecast errors are included in the model.

The ARIMA model is denoted as ARIMA(p, d, q), where:

 p is the order of the autoregressive component,


 d is the degree of differencing,
 q is the order of the moving average component.

Stationarity: The first step in applying ARIMA is to ensure the time series is stationary.
Stationarity implies that the statistical properties of the series, such as mean and variance,
remain constant over time. If the series is non-stationary, differencing can be applied to
remove trend or seasonality until stationarity is achieved.

Identification: Once stationarity is achieved, the next step is to determine the appropriate
order (p, d, q) for the ARIMA model. This involves analyzing autocorrelation and partial
autocorrelation plots to identify the presence of autocorrelation in the series and determine
the orders of the AR and MA components.
Parameter Estimation: After identifying the orders of the ARIMA model, the next step is to
estimate the parameters of the model. This typically involves using methods like maximum
likelihood estimation to find the parameters that best fit the data.

Model Fitting: With the parameters estimated, the ARIMA model is then fitted to the data
using techniques such as the Kalman filter or exact maximum likelihood estimation.

Model Validation: Once the model is fitted to the data, it's important to validate its
performance. This can be done by assessing its accuracy on a holdout dataset or through
cross-validation techniques.

Forecasting: Finally, the validated ARIMA model can be used to make forecasts for future
periods.

The auto_arima function from the pmdarima library to automatically select the best ARIMA
model for the dataset. This function performs a stepwise search to minimize the Akaike
Information Criterion (AIC), a measure of the goodness of fit of a statistical model. By
minimizing the AIC, the function identifies the optimal combination of ARIMA parameters
(p, d, q) that best describe the underlying patterns in the data The output of the auto_arima
function provides information about the best-fitting model, including its order (p, d, q) and
AIC value. In this case, the selected model is ARIMA(0,1,0), indicating that no
autoregressive or moving average terms are necessary, but differencing is needed to make the
series stationary. The summary also includes other diagnostic information, such as the log-
likelihood, AIC, BIC, and sample size.

Fig

Then fits the ARIMA model to the training portion of the dataset using the ARIMA class from
the statsmodels.tsa.arima.model module. This involves specifying the order of the ARIMA
model as (0,1,0) and calling the fit method to estimate the model parameters. The summary
output of the fitted model provides additional diagnostic information, including the log
likelihood, AIC, BIC, covariance type, coefficient standard errors, hypothesis tests, and
model residuals and forecasts for the test portion of the dataset using the predict method of
the fitted model. The predict method takes the start and end indices of the forecast period as
arguments and returns the forecasted values.
Fig

4.2 EXPONENTIAL SMOOTHENING MODEL

Exponential smoothing is a popular method used in time series analysis for forecasting future
data points based on past observations. It's particularly effective when the data exhibits a
trend or seasonality. Exponential smoothing is based on the idea of assigning exponentially
decreasing weights to past observations. The most recent observations are given more weight,
while older observations receive exponentially decreasing weights. This weighting scheme
allows the method to adapt quickly to changes in the underlying pattern of the time series
data.

Single exponential smoothing is suitable for time series data with a trend. However, if the
data also exhibits seasonality, double exponential smoothing, also known as Holt's method, is
more appropriate. In addition to smoothing the level of the series (like single exponential
smoothing), Holt's method also smooths the trend component. The forecast equation in Holt's
method consists of two equations.

the Holt-Winters method for time series forecasting using the ExponentialSmoothing class
from the statsmodels.tsa.holtwinters module.

the Exponential Smoothing class from the statsmodels.tsa.holtwinters module, which contains
the implementation of the Holt-Winters method.

train['conversion_rate'] is passed as the time series data to be used for model fitting.

seasonal='add' specifies that the model should consider additive seasonality.

seasonal_periods=12 indicates that the seasonal component has a period of 12, suggesting
yearly seasonality in the data. The fit() method is called on the model_hw object to fit the
Holt-Winters model to the training data. This step estimates the model parameters, including
level, trend, and seasonal components, using the specified method (additive in this case).

After the model is fitted, the forecast() method is used to generate forecasts for future time
periods.

steps=70 specifies the number of steps (or periods) into the future for which forecasts are
generated. In this case, forecasts for 70 steps (or periods) ahead are computed.

Fig
CHAPTER 5
EVALUATION

5.1 EVALUATION OF ARIMA MODEL

1. Mean Absolute Error (MAE):

MAE measures the average magnitude of errors between predicted and actual

values.
Interpretation: On average, the ARIMA model's predictions are off by

approximately 0.0428 units from the actual values.

2. Mean Squared Error (MSE)


MSE measures the average of the squares of the errors between predicted and actual
values.
Interpretation: The ARIMA model's predictions have a squared error of approximately
0.00204 units on average.
3. Root Mean Squared Error (RMSE)

RMSE is the square root of the average of the squared errors, providing a measure of
the spread of errors.
Interpretation: The ARIMA model's predictions deviate from the actual values by
approximately 0.0452 units on average.
4. Mean Absolute Percentage Error (MAPE)
MAPE expresses errors as a percentage of the actual values, providing insight into the
relative accuracy of predictions.
Interpretation: The ARIMA model's predictions have an average percentage error of
approximately 8.20%.

Overall, these evaluation metrics provide a comprehensive view of the ARIMA model's
performance. Lower values indicate better predictive accuracy, so lower MAE, MSE, RMSE,
and MAPE values suggest that the ARIMA model is more effective in forecasting the data.

For further evaluation of the model, the plotting of the predicted value is used, the forecasted
value is plotted. The forecast is depicted in red line and actual is plotted in blue line.

The plt.plot() function is used to plot the actual conversion rates from the training data
(train['conversion_rate']) against their corresponding dates (train.index). These actual values
are represented by blue lines. Similarly, another plt.plot() function is used to plot the
forecasted conversion rates (pred) against the dates of the test data (test.index). These
forecasted values are represented by red lines.

The plt.fill_between() function is used to plot the confidence intervals around the forecasted
values. These intervals are determined by adding and subtracting 1.96 times the standard
deviation of the residuals from the forecasted values. The confidence intervals are shaded in
pink with an alpha value of 0.3 to make them semi-transparent.

Fig

5.1 EVALUATION OF EXPONENTIAL SMOOTHENING MODEL

1. Mean Absolute Error (MAE): The MAE measures the average magnitude of errors
between the predicted and actual values. It is calculated by taking the average of the
absolute differences between predicted and actual values. In this context, the MAE of
approximately 0.0428 suggests that, on average, the exponential smoothing model's
predictions deviate from the actual values by around 0.0428 units of conversion rate.
2. Mean Squared Error (MSE): The MSE measures the average squared differences
between predicted and actual values. It is calculated by averaging the squared
differences between predicted and actual values. A lower MSE indicates better
predictive accuracy. The MSE of approximately 0.00204 indicates that, on average,
the squared differences between predicted and actual values are around 0.00204 units.
3. Root Mean Squared Error (RMSE): The RMSE is the square root of the MSE and
provides a measure of the typical deviation of the predicted values from the actual
values. It is easier to interpret than MSE because it is in the same units as the original
data. The RMSE of approximately 0.0452 suggests that, on average, the exponential
smoothing model's predictions deviate from the actual values by around 0.0452 units
of conversion rate.
4. Mean Absolute Percentage Error (MAPE): The MAPE measures the average
percentage difference between predicted and actual values, relative to the actual
values. It is calculated by taking the average of the absolute percentage differences
between predicted and actual values. The MAPE of approximately 8.20% suggests
that, on average, the exponential smoothing model's predictions deviate from the
actual values by around 8.20% of the actual conversion rate.

These evaluation metrics offer insights into the accuracy and performance of the exponential
smoothing model in forecasting conversion rates. Lower values of MAE, MSE, RMSE, and
MAPE indicate better predictive performance, while higher values suggest poorer
performance. By considering these metrics, stakeholders can evaluate the reliability of the
exponential smoothing model's predictions and make informed decisions based on the
forecasted conversion rates.
Fig

The red line is the depiction of the actual forecast by the exponential smoothening method,
Which have a least difference between the test sample.
CHAPTER 6
CONCLUSIONS
6.1 CHOOSING THE BEST MODEL

In this evaluation, we compare the performance of the ARIMA and Exponential Smoothing
models using four commonly used evaluation metrics: Mean Absolute Error (MAE), Mean
Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage
Error (MAPE).

1. Mean Absolute Error (MAE): It represents the average of the absolute errors between the
actual and predicted values. A lower MAE indicates better performance

2. Mean Squared Error (MSE): It calculates the average of the squares of the errors between
the actual and predicted values. Like MAE, lower MSE values indicate better performance.

3. Root Mean Squared Error (RMSE): It is the square root of the MSE and represents the
average magnitude of the errors. RMSE is in the same units as the original data, making it
easier to interpret.

4. Mean Absolute Percentage Error (MAPE): It measures the average percentage difference
between the actual and predicted values. MAPE is expressed as a percentage, and lower
values indicate better accuracy.

From the evaluation results:

- For the ARIMA model:

- MAE: 0.0428

- MSE: 0.00204

- RMSE: 0.0452

- MAPE: 8.20%

- For the Exponential Smoothing model:

- MAE: 0.0482

- MSE: 0.00255
- RMSE: 0.0505

- MAPE: 9.23%

Based on the RMSE and MAPE metrics, the ARIMA model outperforms the Exponential
Smoothing model. The ARIMA model has lower values for both RMSE and MAPE,
indicating that it provides more accurate predictions compared to the Exponential Smoothing
model. Therefore, we can conclude that the ARIMA model is the better choice for forecasting
conversion rates in this scenario.

But as this model predicts the future at a constant rate, it cannot be used to assist a marketing
department so the exponential smoothening method is used over the ARIMA model.

Fig

Here we can see the actual value is depicted in blue and The ARIMA is constant at o.57
conversion rate, but the smoothening model predicts the actual value of 0.54 to 0.57 , this
small difference is acceptable rather than a constant rate, so the exponential smoothening
method is selected as the best model.

You might also like