0% found this document useful (0 votes)

20 views56 pages

TFM Widad El Abbassi

This Master's Final Project presents a real-time retail analytics pipeline aimed at enhancing consumer behavior insights within shopping centers and stores. The project utilizes stream processing technologies, specifically Apache Kafka and Apache Flink, to analyze foot traffic and in-store metrics, enabling retailers to optimize product positioning and marketing strategies. The architecture includes data ingestion, processing, and visualization components to deliver real-time insights for improved decision-making in retail operations.

Uploaded by

Laura Garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views56 pages

TFM Widad El Abbassi

Uploaded by

Laura Garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Universidad Politécnica

de Madrid
Escuela Técnica Superior de
Ingenieros Informáticos

Máster Universitario en Ciencia de Datos

Master’s Final Project

A Real-Time Retail Analytics

Pipeline

Author: Widad El Abbassi

Supervisor: Marta Patiño-Martínez

Madrid, July 2020

This Master’s Final Project is submitted to the ETSI Informáticos at Universidad
Politécnica de Madrid in partial fulfillment of the requirements for the degree of
MSc. in Data Science.

Master’s Final Project

MSc. In Data Science
Title: A Real-Time Retail Analytics Pipeline

July 2020

Author: Widad EL ABBASSI

Supervisor:
Marta Patiño-Martínez
LENGUAJES Y SISTEMAS INFORMÁTICOS E INGENIERÍA DE
SOFTWARE
ETSI Informáticos
Universidad Politécnica de Madrid
Abstract

Stream processing technologies are becoming more and more popular within
the retail industry whether we are talking about the physical stores or the e-
commerce. Retailers and mall managers are becoming extremely competitive
among them to provide the best solutions that meet customer expectations. This
can be achieved by studying the customer shopping behavior inside the
shopping center and stores, which will provide us with information about the
pattern of shopping activities and movements of consumers, we can track what
are the most common used routes inside the mall, when we can have a peak
occupancy and what kind of brands or products attracts one more than another
(determine shopper profile). This gathered data will then allow us to make more
affective decisions, about the store layout, product positioning, marketing,
controlling the traffic jam and more.

The purpose of this project is to examine the consumer behavior inside shopping
centers and stores. In particular, we want to generate insights regarding two
types of analytics: Mall Analytics, by measuring the foot traffic, In-mall
proximity traffic and location marketing with the intention to help mall
managers improve security and advertising. Then In-Store Analytics, to help
retailers define the underperforming product categories, compare sales potential
as well as improve the inventory management. Although, the real challenge
doesn’t only lie in storing and managing this huge amount of data but also in
accessing the results and providing reports in real time. With this in mind our
work propose a real-time data processing architecture able to ingest, analyze
and generate visualization reports almost immediately. In detail, the first
component in the proposed pipeline is Kafka connect framework, it will be
responsible for generating continuous flows of sensors and POS (point of sale)
data that will be sent after to the second component, Apache Kafka, a
distributed messaging system that will store those incoming messages into
multiple Kafka topics (for instance: sensor1 in zone1 area1 inside the mall, will
be stored in a particular topic1). The third component in this architecture will
be the processing unit , Apache Flink, a streaming dataflow engine and scalable
data analytics framework that deliver data analytics in real time, one of its most
interesting features is the usage of event timestamp to build time windows for
computations, in this section, several Flink queries will be developed to measure

i
the pre-defined metrics (Mall Foot Traffic, location Marketing…).The fourth
component will be a real-time search and analytic engine, Elasticseach, in
which the results of the previous queries will be stored in indexes and then used
by the final component, Kibana, a powerful visualization tool to deliver insights
and dynamic visualization reports.

Our work consists of implementing this streaming analytics pipeline that will
help mall managers and retailers investigate the whole shopping process, thus,
design more effective development plans and marketing strategies.

ii
Contents
Chapter 1 Introduction ............................................................................1
1.1 Motivation .......................................................................................... 1
1.2 Goal ................................................................................................... 1
1.3 Thesis organization ............................................................................ 2
Chapter 2 Background .............................................................................3
1.4 Big Data ............................................................................................. 3
1.5 The importance of stream processing ................................................. 4
1.6 Stream Processing frameworks ........................................................... 5
1.6.1 Apache Spark............................................................................... 5
1.6.2 Apache Storm .............................................................................. 6
1.6.3 Apache Flink ................................................................................ 7
1.6.4 Apache Samza.............................................................................. 7
1.6.5 From Lambda to Kappa Architecture............................................ 8
1.6.5.1 Lambda Architecture ................................................................ 8
1.6.5.2 Kappa Architecture ................................................................... 9
Chapter 3 System Architecture ..............................................................11
1.7 Kafka ............................................................................................... 11
1.7.1 Kafka connect ............................................................................ 11
1.7.1.1 Apache Avro ............................................................................ 12
1.7.1.2 Schema Registry ..................................................................... 12
1.8 Flink ................................................................................................ 13
1.9 Elasticsearch .................................................................................... 13
1.10 Kibana ............................................................................................. 14
1.11 Data processing architecture ............................................................ 15
Chapter 4 Implementation .....................................................................16
1.12 Environment Setup .......................................................................... 16
1.13 Data Generation ............................................................................... 18
1.13.1 Mall Sensor Data ....................................................................... 18
1.13.2 Purchase Data ........................................................................... 19
1.14 Data Processing................................................................................ 20
1.14.1 Study Case 1: Mall Analytics ...................................................... 21
1.14.1.1 Mall Foot Traffic ................................................................... 21
1.14.1.2 In-Mall Proximity Traffic ...................................................... 25
1.14.1.3 Location Marketing .............................................................. 27
1.14.2 Study Case 2: In-Store Optimization .......................................... 29
1.14.2.1 Define Under-performing Categories .................................... 29

iii
1.14.2.2 Customer Payment Preference ............................................. 31
1.14.2.3 Inventory Checking .............................................................. 32
1.14.2.4 Compare sales potential....................................................... 34
1.15 Data Visualization and Distribution ................................................. 35
Chapter 5 Conclusion ............................................................................41
2 Bibliography ....................................................................................43
3 Annexes ...........................................................................................44
Annex 1 ..................................................................................................... 44
Annex 2 ..................................................................................................... 47

iv
List of figures
Figure 1 : 5Vs of Big Data ............................................................................... 3

Figure 2 : Apache Spark Architecture .............................................................. 6

Figure 3 : Apache Storm Architecture ............................................................. 6

Figure 4 : Flink Distributed Execution ............................................................ 7

Figure 5 : Apache Samza Architecture............................................................. 8

Figure 6 : Lambda Architecture ....................................................................... 9

Figure 7 : The Kappa Architecture................................................................. 10

Figure 8 : Confluent Schema Registry for storing and retrieving schemas ..... 12

Figure 9 : Process of retrieving schemas from Confluent Schema Registry .... 12

Figure 10 : Flink Components Stack ............................................................. 13

Figure 11 : Elasticsearch for fast streaming .................................................. 14

Figure 12 : Kibana Dashboard created from the Project Visualizations .......... 14

Figure 13 : Retail Analytics Pipeline .............................................................. 15

Figure 14 : Docker-compose File ................................................................... 16

Figure 15 : Pom.xml File ............................................................................... 17

Figure 16 : Configuration of Datagen Connector ........................................... 19

Figure 17 : Sensor records in kafka topics ui ................................................ 19

Figure 18 : Purchase data in kafka topics ui ................................................. 20

Figure 19 : Mall Foot Traffic Chart ................................................................ 35

Figure 20 : In-Mall Proximity Traffic Heat Map .............................................. 36

Figure 21 : Stacked Pie chart for Traffic Control ............................................ 37

Figure 22 : Bar Chart for Categories Sales .................................................... 37

Figure 23 : Bar Chart for Payment Preference ............................................... 38

Figure 24 : Pie chart for stock control ........................................................... 39

Figure 25 : Area chart for sales growth.......................................................... 39

iv
Chapter 1 Introduction
1.1 Motivation

The market of retail changed dramatically in the last few years, in addition to
the physical retailers that struggle every day, we now have the online retailers
that added up, and more new brands trying to earn their place in the market.
Taken together all of this, Retailers are adopting new store format strategies in
order to meet the customer’s expectations and industry competition, thus they
started using new technologies such as BLE beacons, Wifi tracking, and
POS(point of sales) to track shoppers behavior using location, time and activity.
This continuous flow of sensor readings will create a wealth of data that can be
processed by a streaming framework such as Flink in order to get quick insights,
and therefore make it possible to quantify the customer’s journey and purchase
pattern, so that at the end, retailers will be able to detect, identify, and track
where people are (location), how long they stay (time), and what they do
(activities), which will help them to be aware of the store’s sales opportunity,
how to optimize their product positioning and adopt location marketing to send
notification to their customers about the offers and discounts that would match
their interest.

1.2 Goal

The goal of this project is to design and implement a retail analytical solution
that help retailers to get more value from their existing resources, improve their
in-store or mall analytic metrics and therefore increase the business benefit.
We start at first by determining the set of problems we aim to solve, for instance
we want to track the visitors pattern inside the mall to control the traffic, then
we need to think of the type of data that will allow us to measure this metric,
which fields do we need to define for generating the data and how we must
produce the records. To do this, we have chosen the first component of our
architecture will be Kafka Connect framework, which will use its Datagen
connector to produce data according to the Avro schemas we will define and
store in Confluent Schema Registry. As a result we will have 2 continuous
streams of data, one representing mall sensors and the other reflecting the data
sent by the point-of-sale systems.
Now that we have constructed the data, we will choose which streaming
processing engine will be deployed, many frameworks can be used to build the
queries for this type of scenario, and our choice was Apache Flink because of
its event timestamps that help set time windows for computations along with
its other features. Flank programs will be developed within two scenarios: first,
we will program a set of metrics for Mall Analytics which are respectively Mall
Foot Traffic, In-Proximity Traffic and Location Marketing. Secondly, we will
program key performance indicators for In-Store-Optimization, which will help
1
define the underperforming products inside the store, detect customer
preferences, and construct sales report and finally optimizing the inventory
management.
Next, we will forward the queries results to Elasticsearch indexes where the data
can be analyzed and explored by Kibana, the visualization tool intended to
create dynamic dashboards and deliver quick insights about the retail
operations, consumer demand and sales. To give an illustration of this, let’s look
at the case where we want to quantify the ability of a store to convert demand
into revenue, which is also known as the sales conversion, to do so we will first
need to gather the data from the POS as well as the store sensors, secondly we
will create a Flink query that will calculate the ration of transactions (buyers)
on visitors, in a window of one day for example, each day the result will be stored
in Elasticsearch and each month the dashboard will be updated in kibana with
the changes and transitions seen in the sales conversion, this way we will be
able to deliver business conclusions and propose optimization for current
strategies.

1.3 Thesis organization

The rest of this thesis is organized as follows:

In Chapter 2, we discuss the basic concepts of big data technologies, stream

processing frameworks as well as the most used real time processing
architectures.

In Chapter 3, we present the technical background needed for this project,

including an overview on each technology used along with the role it plays in
this thesis.

Chapter 4 represent the core of this thesis, its where we introduce the
different steps followed to setup the work environment and design the retail
application starting from generating data into visualizing the results.

In Chapter 5, we summarize the thesis and point out to future work that can
extend this project.

2
Chapter 2 Background
To provide a better understanding of this project, this chapter highlights the
basic concepts of big data technologies, and more specifically the importance of
streaming analytics in today’s world.

1.4 Big Data

There are several definitions that compete to clarify the term "Big Data", here
we have chosen the most credible one:

« Extremely large data sets that may be analyzed computationally to reveal patterns,
trends, and associations, especially relating to human behavior and interactions ».
Oxford Dictionary

Nevertheless, all the definitions described Big Data based on the 5V principle
that defines the various challenges encountered when designing algorithms or
software systems capable of processing this data.

In the figure below, I’ve made a representation of these 5Vs:

Figure 1 : 5Vs of Big Data

o Volume: Due to the data explosion caused to digital and social media,
data is rapidly being produced in such large chunks, and the
3
conventional methods of business intelligence and analytic can’t keep up
with how to store and process it.
Big data volume defines the ‘amount’ of data that is generated. The value
of data is also dependent on the size of the data.

o Velocity: refers to the speed at which the data is being produced,

collected and analyzed. Data continuously flows through multiple
channels such as computer systems, networks, social media, mobile
phones etc.

o Variety: Data sources in big data may involve external sources as well
as internal business units. Generally, big data is classified as structured,
semi-structured and unstructured data. In fact, almost 80 percent of
data produced globally including photos, videos, mobile data, and social
media content, is unstructured in nature.

o Veracity: Veracity in Big Data is related to the Big Data security and
revolves around two aspects: data consistency (or certainty) what can be
defined by their statistical reliability; and data trustworthiness that is
defined by a number of factors including data origin, collection and
processing methods, including trusted infrastructure and facility. Big
Data veracity ensures that the data used are trusted, authentic and
protected from unauthorized access and modification.

o Value: In the context of big data, value amounts to how worthy the data
is of positively impacting a company’s business. What you do with the
collected data is what matters. With the help of advanced data analytics,
useful insights can be derived from the collected data. These insights, in
turn, are what add value to the decision-making process.

1.5 The importance of stream processing

Before stream processing emerged, the generated data was often stored in
databases, file systems, or other forms of mass storage. Applications would
query the data or compute over the data as needed, but they could never query
those streams continuously.
Stream processing emerged to enable the processing of data directly as it is
produced or received. It is used to query continuous data streams and detect
conditions, quickly, within a small time period from the time of receiving the
data ranking from milliseconds to minutes. It can be also referd to by many
names: real-time analytics, streaming analytics, Complex Event Processing,
real-time streaming analytics, and event processing.
The majority of today’s world industries, such as retail, stock market
surveillance and fraud detection require real time insights, in those business
sectors, data should be captured and processed as close to real-time as possible,

4
making the right data available at the right time, this data might be coming as
sensor events, user activity on a website, financial trades, and so on.
The need for stream processing technologies comes from many reasons:
¾ Batch processing is not the optimal solution when it comes to dealing
with a never-ending stream of events, it will require to stop the data
collection each time we want to store it and process it, afterwards, we will
have to do the next batch and then worry about aggregating across
multiple batches.
On the other hand, streaming handles neverending data streams
naturally and easily. It can detect patterns, inspect results, look at
multiple levels of focus, and also look at data from multiple streams
simultaneously.

¾ Stream processing process data stream as they come in, hence it spreads
the processing over time instead of let in it build up and then process it
at once like batch processing.

¾ Stream processing also enables approximate query processing via

systematic load shedding. Thus, stream processing fits naturally into use
cases where approximate answers are sufficient.

1.6 Stream Processing frameworks

Many distributed Big Data platforms are developed to provide scalable
processing on commodity clusters. Apache Hadoop is one of the most popular
frameworks for batch processing, it make use of MapReduce programming
model Along with its HDFS file system. Many large-scale computing batch
processing architectures are available ; however, they are not suitable for stream
processing; because in the MapReduce paradigm all input data must be stored
on a distributed file system (like HDFS) before being processed.
As for the large-scale real-time processing problem, some distributed
frameworks such as Apache Storm, Spark Streaming and Apache Flink have
been emerged offering low-latency and high throughput. These recent Big Data
platforms are becoming one of the most frameworks in today’s industries:
1.6.1 Apache Spark

Apache Spark is a widely used, highly flexible engine for batch-mode and stream
data processing that is well developed for scalable performance at high volumes.
To maximize the performance of Big Data analytics applications, Spark in-
memory data processing engine conducts analytics, ETL, machine learning and
graph processing on data whether in motion or at rest.
The Apache Spark Architecture is founded on Resilient Distributed Datasets
(RDDs). These are distributed immutable tables of data, which are split up and
allocated to workers. The worker executors implement the data. The RDD is
immutable, so the worker nodes cannot make alterations; they process
information and output results.
5
A Spark application takes data from a collection of sources (HDFS, NoSQL and
relational DBs, etc.), applies a set of transformations on them, and then
executes an action that generates meaningful results.

Figure 2 : Apache Spark Architecture

1.6.2 Apache Storm

Introduced by Twitter, Apache Storm is the most popular and widely adopted
open-source distributed real-time framework. It is extremely fast, with the
ability to process over a million records per second per node on a cluster of
modest size with very low latency, Storm is suitable for near real time processing
workloads. It processes large quantities of data and provides results with lower
latency than most other available solutions.
The Apache Storm Architecture is founded on spouts and bolts. Spouts are
origins of information and transfer information to one or more bolts. This
information is linked to other bolts, and the entire topology forms a DAG.
Developers define how the spouts and bolts are connected

Figure 3 : Apache Storm Architecture

6
1.6.3 Apache Flink

Apache Flink is an open-source streaming platform that provides capability to

run real-time data processing pipelines in a fault-tolerant way, and at a scale of
millions of tuples per second. Flink is a stream processing framework that can
also handles batch tasks, it approaches batches as data streams with finite
boundaries rather than processing micro-batches. Flink processes the user-
defined functions code through the system stack. It has master-slave
architecture, consists of a Job Manager and one or more Task Manager(s).
The duty of the Job Manager is coordination of all the computations in the Flink
system, while the Task Managers are being used as workers and execute parts
of the parallel programs. Flink is famous for its ability to compute common
operations such as hashing, very efficiently.

Figure 4 : Flink Distributed Execution

1.6.4 Apache Samza

Apache Samza is formed by combining Apache Kafka and YARN, it uses a

publish/subscribe task, which observes the data stream, processes messages,
and outputs its findings to another stream. Samza can divide a stream into
multiple partitions and spawn a replica of the task for every partition.
Apache Samza uses the Apache Kafka messaging system, architecture, and
guarantees, to offer buffering, fault tolerance, and state storage, while it relies
on YARN for resource negotiation. However, a Hadoop cluster is needed (at least
HDFS and YARN).
Samza has a callback-based process message API. It works with YARN to provide
fault tolerance, and migrates your tasks to another machine if a machine in the
cluster fails. Smaza processes messages in the order they were written and

7
ensures that no message is lost. It is also scalable as it is partitioned and
distributed at all levels.

Figure 5 : Apache Samza Architecture

The main goals of Apache Samza are having better fault tolerance, processor
isolation, security, and resource management.

1.6.5 From Lambda to Kappa Architecture

With the increasing volume of data and need to analyze and obtain value from
the generated data as soon as possible, there is a need to define new
architectures to cover use cases different from the existing ones.
The most common architectures used by companies are mainly two: Lambda
Architecture and Kappa Architecture. The main difference between both will be
explained in the following sections.

1.6.5.1 Lambda Architecture

The data stream entering the Lambda system is dual fed into both a batch and
speed layer as shown in figure 6.

8
Figure 6 : Lambda Architecture

This architecture objective is to have a robust, fault-tolerant system, both from

human error and hardware that is linearly scalable and allows writing and
reading with low latency.
The characteristics of Lambda Architecture can be described as follow:
x The new information collected by the system is sent to both the batch
layer and the streaming layer (Speed Layer).
x In the batch layer (Batch Layer) the raw information is managed,
unmodified. The new data is added to the existing one. After this, a
treatment is done through a batch process whose result will be the Batch
Views, these version of views will be available in the serving layer to offer
the information already processed to the outside.
x The layer that serves the data, or Serving Layer, indexes the Batch Views
generated in the previous step so that they can be consulted with low
latency.
x The Speed Layer, compensates the high latency of the writings that occur
in the serving layer and only takes into account new incoming data.
x Finally, the answer to the queries made is constructed by combining the
results of the Batch Views and the real-time views.

1.6.5.2 Kappa Architecture

The Kappa Architecture focuses on only processing data as a stream. It is not a

replacement for the Lambda Architecture, except for where your use case fits.
For this architecture, incoming data is streamed through a real-time layer and
the results of which will be directly placed in the serving layer for queries.
This streaming layer, unlike the batch layer, does not have a beginning or an
end from a temporal point of view and is continuously processing new data as

9
it arrives, In other words, this architecture can be used in the case where the
only thing that really matters is when you are analyzing this incoming data.

Figure 7 : The Kappa Architecture

The Kappa Architecture has four main pillars:

x Everything is a stream: batch operations are a subset of streaming
operations, so everything can be treated as a stream.

x The starting data is not modified: the data is stored without it being
transformed and the views are derived from it. A specific state can be
recalculated since the source information is not modified.

x There is only one processing flow: since we maintain a single flow, the
code, the maintenance and the system update are considerably reduced.

x Possibility of re-launching a processing: you can modify a specific process

and its configuration to vary the results obtained from the same input
data.

This evolution consists of a simplification of the Lambda architecture, in which

the batch layer is eliminated and all the processing is done in a single layer
called Real-time Layer, giving support to both batch and real-time processing.

10
Chapter 3 System Architecture
This chapter describes the main components used to build the work
environment. It discusses briefly the function and features provided by each of
the frameworks in general as well as project specific.
The goal behind this developed architecture is to help whether retailers or Mall
managers take immediate process-based action on the discovered insights so
they can plan out the right changes to stores at the right time.
The set of problems we aim to solve with this architecture can be dived into two
big sections:
First use case would be Mall Analytics, where is it important to keep track of
customers inside the shopping centers to identify Foot-Traffic trends, improve
security and advertising. The second use case is In-Store-Optimization, where
information like customer payment method, products bought during the day
and others that have been returned can help retailers define a customer’s profile,
and therefore come up with more personalized offers. Furthermore, they also
need solutions to optimize their inventory management, conduct deeper
analysis of their sales potential, and thus improve their real-time decisions

1.7 Kafka
Apache Kafka is an Open Source application whose main function is the
centralization of data flows coming from various systems inside a company. This
platform was born as a distributed messaging system in publish-subscribe
mode, capable of easily loading, supporting very high data rates and ensuring
the persistence of the data it receives. The data received by Apache Kafka is kept
within the topic, and each topic corresponds to a category of data. The systems
that publish data in Kafka topics are Producers while the systems that read
data from topics are Consumers.
In this thesis Kafka is used to deliver stream events into Apache Flink without
transforming it, it will receive the messages from the data sources, store them
within topics, and then push them to Flink for processing. Furthermore, it can
also be used in the opposite way, receiving the results from Flink, put them into
topics, so that they can be pulled by the Mall Application.

1.7.1 Kafka connect

Kafka Connect is a framework for connecting kafka with external systems such
as MySQL, HDFS and search indexes. That is to say, it is mainly focused on
streaming data in and out of Kafka using source and sink connector plugins
with a very high write and read throughput.
However, in this thesis we will not use this tool to import or export data, but
rather to generate records using the kafka-connect-datagen connector, which is
bundled with some quickstart schema specifications along with their
configuration files intended to produce messages. Yet, none of them matches

11
the type of events we’d like to generate, that’s why we defined our own schema
specification using Apache Avro.

Figure 8 : Confluent Schema Registry for storing and retrieving schemas

1.7.1.1 Apache Avro

Apache Avro is a data serialization system used to define the data schema for
record’s value, this schema will describe the fields allowed in a certain value
along with their data types using JSON format.
In this project, we used Apache Avro to customize the fields and their values to
match the application requirement in having sensor data and purchase data.
1.7.1.2 Schema Registry

Confluent Schema Registry provides a serving layer for the metadata. It provides
a RESTful interface for storing and retrieving Avro schemas, it will keep a
version of our schemas, compatibility settings and serializers.
The Schema Registry lives outside of and separately from the Kafka brokers.
Producers and consumers still talk to Kafka to publish and read data (messages)
to topics. Concurrently, they can also talk to Schema Registry to send and
retrieve schemas that describe the data models for the messages. Likewise, in
our architecture Flink will synchronize with Confluent Schema Registry to
retrieve the Sensor and POS data schemas as shown below:

Figure 9 : Process of retrieving schemas from Confluent Schema Registry

12
1.8 Flink
Apache Flink is a distributed stream processor with intuitive and expressive
APIs to implement stateful stream processing applications. It efficiently runs
such applications at large scale in a fault-tolerant manner. Flink on its own is
the runtime core engine which does stream processing. On top of the engine,
Flink has it abstraction APIs: data set API to consume and process batch data
sources and data stream API to consume and process real time streaming data.
These two APIs are programming abstractions and are the foundation for user
programs.

Figure 10 : Flink Components Stack

Flink stand out thanks to many features, for instance: its event-time and
processing-time semantics, exactly-once state consistency guarantees, layered
APIs with varying tradeoffs for expressiveness and ease of use, millisecond
latencies while processing millions of events per second in addition to many
other features.
In this thesis, Flink is deployed as the stream processing framework of choice
for our streaming retail application, it will continuously consume the streams
of events provided by Kafka, apply transformations on them and then update
the result into the search indexes.
Finally, event timestamps can be used to build time windows for
computations.

1.9 Elasticsearch

Elasticsearch is an open-source, RESTful, distributed search and analytics

engine built on Apache Lucene. It allows you to store, search, and analyze big
volumes of data quickly and in near real time.
Elasticsearch is used in this Project to enable search fonctionality in our
application, Flink will push the results of the transformations to Elasticsearch
indexes to be queried after and then analyzed by Kibana.

13
Figure 11 : Elasticsearch for fast streaming

Apache Flink executes stream analysis jobs on our Sensor data and POS data,
apply transformations to analyze, transform, and model the data in motion, and
finally write their results to an Elasticsearch index. Kibana connects to the
index and queries it for data to visualize.

1.10 Kibana
Kibana is an open-source data visualization and exploration tool used for log
and time-series analytics, application monitoring, and operational intelligence
use cases. It offers powerful and easy-to-use features such as histograms, line
graphs, pie charts, heat maps, and built-in geospatial support.
Furthermore, we can set a time filter that displays only the data within a
specified time range for time based events.
Here, Kibana will be integrated with Elasticsearch to create data visualization
dashboard from the data stored in the indexes. This analytical platform will help
generate insights about a company’s business operations.

Figure 12 : Kibana Dashboard created from the Project Visualizations

14
1.11 Data processing architecture

In this project we will opt for a very common approach for ingesting and
analyzing huge volumes of streaming data, which is known as the Kappa
architecture. It is based on a streaming architecture in which an incoming series
of data is first stored in a messaging engine like Apache Kafka. From there, a
stream processing engine will read the data and transform it into an analyzable
format, and then store somewhere for end users to query.
It has four main principles: data is immutable, everything is a steam, a single
stream engine is used, data can be replayed and it’s implemented as follow:

Figure 13 : Retail Analytics Pipeline

In this pipeline outputs of different jobs can be re-used as inputs to another.

Each job can thus be reduced to a simple, well-defined role. The composability
allows for fast development as well as flexibility of adding new features. In
addition, data ordering and delivery are guaranteed, making results consistent.

15
Chapter 4 Implementation
Despite the fact that retailers have been using data analytics to develop
business intelligence for years, the extreme complexity of today’s data
requires new approaches and tools. This is because the retail industry has
entered the big data era, and therefore having access to a huge amount of
information that can be used to optimize the shopping experiences and
forge tighter relationships between customers, brands, and retailers.
This chapter is devoted to describe the steps followed in order to build the
working environment, generate the test data, create the Flink queries and
provide business insights to support retail decision making.

1.12 Environment Setup

In this Project, our pipeline is divided into to three main blocks: Data generation,
Data processing and Data visualization, which means we need to take into
consideration the resource allocation that we will need for each service, the best
choice here is to opt for a container based technology such as Docker that
support multiple workloads while using the same OS. To setup our work
environment, we will use docker-compose to define and run the multi-container
services, therefore, our environment will be defined as follow:

Figure 14 : Docker-compose File

16
Inside this docker yaml file, we need to define the properties needed for each
service such as: Port Number, links with other containers, and the
environmental variables. The rest of the file is placed in (Annexe 1).
In this cluster, we will work only with one broker and one zookeeper, given the
limited resources we have, same thing goes for task managers for Flink.
Despite that, we will try to implement some Flink fault tolerance mechanisms
such as state and checkpoints that enable programs to recover in case of
failures.
As for developing Flink queries we will use the IntelliJ IDEA along with Maven
build tools to build and manage the project, Also, we need to populate the
POM file with the required dependencies for Flink, Kafka and Elasticsearch.
The following file shows an overview of the POM file:

Figure 15 : Pom.xml File

After setting up the cluster environment on Docker, we need to make sure that
the containers are defined inside the same docker network, in order to
guarantee the communication among containers.

17
1.13 Data Generation

The first step in designing any end-to-end pipeline is to make sure we dispose
of a plentiful amount of data to test the different possible scenarios of the
desired application. There are many data generation tools available on the
Internet that allows us to create datasets at a few clicks, and then export them
to a bunch of formats, including CSV, JSON, XML and even SQL. The problem
with this type of approaches is not only the fact that they do not work well for
producing records with complex data types ( e.g. records with multiple fields or
randomizing the data while maintaining order ) but also that this technique is
not very realistic. With this in mind we need an approach that allows us to
generate data in real-time, the reason why we choose to use Kafka Connect
Datagen connector, for producing realistic customized data using our own
schema specifications with our own fields.
To define the data schema for records values, we will work with Apache Avro
format to describe the fields, the interval at which data is produced and the
number of iterations. In view of this, we will define two Avro schemas, the first
will be responsible for generating data for multiple Mall Sensors (BLE, WIFI
location tracking …) and the second schema will produce purchase Data that
normally will be sent via the POS (Point of Sales) of the different shops.

1.13.1 Mall Sensor Data

First, we need to define the schema that will be responsible for generating
Sensor data for multiple zones inside the mall: Food Court, Fashion, Home
Decoration and Electronics.
The fields are defined as follow:
Sensor UID: UID of the sensor sending the customer’s data.
MAC address: of the customer’s phone detected by the sensor.
IP address: of the customer’s, for instance if he used the WIFI of the Mall of any
shop.
Longitude, Latitude: to locate the customer inside the Mall, the coordinates
values corresponds to El Corte Ingles in Plaza de Callao, Madrid.
Timestamp: of the sensor.
The entire schema file is placed in (Annexe 2 ) and need to be moved inside the
kafka-connect container in order to be accessed by the Datagen connector.
Secondly, we need to specify the connector parameters as shown in figure 16.

18
Figure 16 : Configuration of Datagen Connector

The same schema will be used to produce four streams of Data: Zone1, Zone2,
Zone3, and Zone4, which will result in four Datagen connectors.
After starting the four connectors, we can verify the incoming messages on
Kafka:

Figure 17 : Sensor records in kafka topics ui

1.13.2 Purchase Data

For generating purchase data the schema will be defined by eight fields as follow
(Annexe 2):
Seller Name: Name of the worker who accomplished the purchase.
Product ID: reference of the sold product inside the store.
19
Product Category: categories of the store: Accessories, Shoes & Bags, Women,
Kids, Men and Home Care.
Items Available in stock: number of items left for a specific product in the stock.
Payment method: typical method payment used are: Cash, Mobile Payment and
Credit Card.
Amount: price paid during the purchase.
Loyalty Card: specify if the customer belongs to a loyalty program.
Timestamp: of the purchase provided by the POS.
For this schema, we will settle with only two streams coming from POS1 and
POS2, and the same connector configuration applied for sensor data will be
used for this two connectors with modifying the filepath, kafka topic and the
number of iterations to only 100.
In Kafka, we can see the incoming messages:

Figure 18 : Purchase data in kafka topics ui

Now that we have our real-time data a hand, we can move forward in the project
and start developing our Flink queries.

1.14 Data Processing

After setting up the data sources, the next step consist of exploring this data
through a bunch of queries that help determine which elements of data would
be useful to solve the retail problems. Having said that, we first need to
determine the metrics that can accurately measure the effectiveness of the
developed queries, in other words, we need to specify the metrics that can

20
increase sales opportunity and measure performance of our mall stores,
therefor, in this project we selected the following core metrics for evaluating
the mall and stores performances:
Case study 1: Mall Analytics
x Mall Foot Traffic.
x In-Mall Proximity Traffic.
x Location Marketing.
Case study 2: In-Store Optimization
x Define Under-performing Categories.
x Customer Payment Preference.
x Inventory Checking.
x Compare sales potential.

Before jumping to Flink code, it is important to note that the structure for
each query will be as follow:
– Read the streams of events from a Kafka topic.
– Perform transformations on the incoming messages.
– Write the results of operations to Elasticsearch.
1.14.1 Study Case 1: Mall Analytics
1.14.1.1 Mall Foot Traffic
Mall Foot Traffic is a key indicator for both shopping centers owners and stores,
it gives you an idea on the sales opportunity of all visitors, and it’s calculated as
the sum of all incoming shoppers, per a period of time. To translate this as a
Flink program, we have done as follow:
– We start the program by reading the required Kafka parameters props
that Flink Consumer will use to pull messages from the topic “1” in
Kafka, however, we also need to pass the deserialization schema as well
as the path to the schema-registry “http://schema-registry:8081” to
the FlinkKafkaConsumer<> since we have generated events using our
own customized Avro schema that resides in the Cofluent Schema
Registry.
– In the same way, we will create four Flink Datastreams out of four
Kafka topics respectively for Zone1, Zone2, Zone3 and Zone4.
– Likewise, we will define a map function that will take the sensor data as
input and return a Tuple3<String,Long, Integer>
(value.getSensorUID(), value.getTimestamp(), 1), where 1 at the
end will act as counter. The map function will then be continuously
updated with new tuples as soon as new records will be read.
– Next, we need to perform a count aggregation for Mall Traffic and the
most efficient way to do it is to use a ReduceFunction that will sum the
ones provided by the mapped streams.
– Now that we have the count of visitors for each zone we will move
forward and join the resulting streams in order to produce a

21
joinedstream that returns the Mall Traffic for the whole shopping
center.
– Finally, we will create an Elasticsearch sink that will be responsible for
sending the results back to an Elasticsearch index flink-streams,
which in turn will be used by Kibana to deliver quick insights.
The corresponding Flink code is shown below:

public class FlinkQuery1 {

public static void main(String[] args) throws Exception{

ParameterTool parameterTool = ParameterTool.fromArgs(args);

// create execution environment

StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();

env.getConfig().setGlobalJobParameters(parameterTool);
// we set the time characteristic to include a processing time window
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);

// Define Kafka Properties

Properties props = new Properties();
props.setProperty("bootstrap.servers","kafka:9092");
props.setProperty("schema.registry.url","schema-registry:8081");
props.setProperty("group.id", "Retail Project");

//MallFootTraffic/2H ,customers will be identified by their Mac @

//Zone1~Fashion

DataStream<Zone1_Sensors> zone1 = env.addSource(new FlinkKafkaConsumer<>

("1", ConfluentRegistryAvroDeserializationSchema.forSpecific
(Zone1_Sensors.class, "http://schema-registry:8081") , props));

DataStream<Tuple3<String,Long,Integer>> mapped = zone1.map(new

MapFunction<Zone1_Sensors, Tuple3<String, Long, Integer>>() {
@Override
public Tuple3<String,Long, Integer> map(Zone1_Sensors value) throws
Exception {

return new Tuple3<String,Long, Integer>(value.getSensorUID(),

value.getTimestamp(), 1);
}});

DataStream<Tuple3<String, Long, Integer>> resultzone1 = mapped

.keyBy(0)
.reduce(new ReduceFunction<Tuple3<String, Long, Integer>>() {
@Override
public Tuple3<String, Long, Integer> reduce(Tuple3<String, Long, Integer>
value1, Tuple3<String, Long, Integer> value2) throws Exception {
return new Tuple3<String, Long,
Integer>(value1.f0,value1.f1,value1.f2+value2.f2);
}});

//Zone2~Food Court

22
DataStream<Zone2_Sensors> zone2 = env.addSource(new FlinkKafkaConsumer<>
("2", ConfluentRegistryAvroDeserializationSchema.forSpecific
(Zone2_Sensors.class, "http://schema-registry:8081") , props));

DataStream<Tuple3<String,Long,Integer>> mapped2 = zone2.map(new

MapFunction<Zone2_Sensors, Tuple3<String, Long, Integer>>() {

@Override
public Tuple3<String,Long, Integer> map(Zone2_Sensors value) throws
Exception {

return new Tuple3<String,Long, Integer>(value.getSensorUID(),

value.getTimestamp(), 1);
}});

DataStream<Tuple3<String, Long, Integer>> resultzone2 = mapped2

.keyBy(0)
.reduce(new ReduceFunction<Tuple3<String, Long, Integer>>() {

@Override
public Tuple3<String, Long, Integer> reduce(Tuple3<String, Long, Integer>
value1, Tuple3<String, Long, Integer> value2) throws Exception {
return new Tuple3<String, Long,
Integer>(value1.f0,value1.f1,value1.f2+value2.f2);
}});

//Zone3~Home Decoration
DataStream<Zone3_Sensors> zone3 = env.addSource(new FlinkKafkaConsumer<>("3",
ConfluentRegistryAvroDeserializationSchema.forSpecific(Zone3_Sensors.class,
"http://schema-registry:8081") , props));

DataStream<Tuple3<String,Long,Integer>> mapped3 = zone3.map(new

MapFunction<Zone3_Sensors, Tuple3<String, Long, Integer>>() {

@Override
public Tuple3<String,Long, Integer> map(Zone3_Sensors value) throws Exception
{
return new Tuple3<String,Long, Integer>(value.getSensorUID(),
value.getTimestamp(), 1);
}});

DataStream<Tuple3<String, Long, Integer>> resultzone3 = mapped3

.keyBy(0)
.reduce(new ReduceFunction<Tuple3<String, Long, Integer>>() {

@Override
public Tuple3<String, Long, Integer> reduce(Tuple3<String, Long, Integer>
value1, Tuple3<String, Long, Integer> value2) throws Exception {
return new Tuple3<String, Long, Integer>
(value1.f0,value1.f1,value1.f2+value2.f2);
}});

//Joining streams of the different zones

DataStream<Tuple3<Long,Integer,Integer>> firstStream =

23
resultzone1.join(resultzone2).where(
new KeySelector<Tuple3<String,Long,Integer>, Long>() {

@Override
public Long getKey(Tuple3<String,Long,Integer> value) throws Exception {
return value.f1;
}}
).equalTo(
new KeySelector<Tuple3<String,Long,Integer>, Long>() {
@Override
public Long getKey(Tuple3<String,Long,Integer> value) throws Exception {
return value.f1;
}}
).window(TumblingProcessingTimeWindows.of(Time.minutes(20)))
.apply(new JoinFunction<Tuple3<String,Long,Integer>,
Tuple3<String,Long,Integer>, Tuple3<Long,Integer,Integer>>()
{
@Override
public Tuple3<Long, Integer, Integer> join(Tuple3<String, Long, Integer>
zone1, Tuple3<String, Long, Integer> zone2) throws Exception {
return new Tuple3<Long, Integer, Integer>(zone1.f1,zone1.f2,zone2.f2);
}});

DataStream<Tuple4<Long,Integer,Integer,Integer>> joinedStream =
firstStream.join(resultzone3).where(
new KeySelector<Tuple3<Long,Integer,Integer>, Long>() {

@Override
public Long getKey(Tuple3<Long,Integer,Integer> value) throws Exception {
return value.f0;
}}
).equalTo(
new KeySelector<Tuple3<String,Long,Integer>, Long>() {

@Override
public Long getKey(Tuple3<String,Long,Integer> value) throws Exception {
return value.f1;
}}
).window(TumblingProcessingTimeWindows.of(Time.minutes(20)))
.apply(new JoinFunction<Tuple3<Long,Integer,Integer>,
Tuple3<String,Long,Integer>, Tuple4<Long,Integer,Integer,Integer>>()
{
@Override
public Tuple4<Long,Integer,Integer,Integer> join(Tuple3<Long,Integer,Integer>
zone1_2, Tuple3<String, Long, Integer> zone3) throws Exception {
return new
Tuple4<Long,Integer,Integer,Integer>(zone1_2.f0,zone1_2.f1,zone1_2.f2,zone3.f2
);
}});

//ElasticSearch Sink Config

List<HttpHost> httpHosts = new ArrayList<>();

httpHosts.add(new HttpHost("172.18.0.4", 9200, "http"));

ElasticsearchSink.Builder<Tuple4<Long,Integer,Integer,Integer>> esSinkBuilder
= new ElasticsearchSink.Builder<>(httpHosts,new ElasticsearchSinkFunction

24
<Tuple4<Long,Integer,Integer,Integer>>() {

public IndexRequest createIndexRequest(Tuple4<Long,Integer,Integer,Integer>

element) {
Map<String, Object> json = new HashMap<>();
json.put("timestamp", element.f0);
json.put("visitors_Zone1", element.f1);
json.put("visitors_Zone2", element.f2);
json.put("visitors_Zone3", element.f3);

return Requests.indexRequest()
.index("flink-streams")
.type("flink-stream")
.source(json);
}

@Override
public void process(Tuple4<Long,Integer,Integer,Integer> element,
RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder.setBulkFlushMaxActions(1);
joinedStream.addSink(esSinkBuilder.build());

env.execute();

}
}

1.14.1.2 In-Mall Proximity Traffic

While mall foot traffic quantifies the number of visitors during a regular period
of time, in mall proximity will sum the number of visitor per zone, that is to say
it will give the owners an idea about the traffic flow around the mall. The steps
to develop the corresponding Flink code is the following:
– For this query instead of using a ReduceFunction, we will opt for a
ProcessWindowFunction that will iterate over the sensor stream and
maintain the count of elements in a
TumblingProcessingTimeWindows.of(Time.minutes(30))).
– Inside this process function, we will filter the records as they enter
using the longitude and latitude coordinates in order to get the counts
per area inside a specific zone, here I took an example of three areas
that may exist inside the fashion zone1.
– As a result, we will have a Tuple2<String,Integer>(area, count)
containing the name of each area along with the number of visitors in
the same time.
Flink code is described in the next page.

25
public class FlinkQuery2 {
public static void main(String[] args) throws Exception{

ParameterTool parameterTool = ParameterTool.fromArgs(args);

StreamExecutionEnvironment env = StreamExecutionEnvironment.

getExecutionEnvironment();

env.getConfig().setGlobalJobParameters(parameterTool);

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

// Define Kafka Properties

Properties props = new Properties();
props.setProperty("bootstrap.servers","kafka:9092");
props.setProperty("schema.registry.url","schema-registry:8081");
props.setProperty("group.id", "test");

//In-mall proximity Traffic

DataStream<Zone1_Sensors> zone1 = env.addSource(new FlinkKafkaConsumer<>

("1",ConfluentRegistryAvroDeserializationSchema.forSpecific
(Zone1_Sensors.class, "http://schema-registry:8081"), props)
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor
<Zone1_Sensors>() {

@Override
public long extractAscendingTimestamp(Zone1_Sensors element) {

return element.getTimestamp();
}
}));

DataStream<Tuple2<String,Integer>> proximityTraffic = zone1

.keyBy("Timestamp")
.window(TumblingProcessingTimeWindows.of(Time.minutes(30)))
.process(new org.apache.flink.streaming.api.functions.windowing
.ProcessWindowFunction<Zone1_Sensors, Tuple2<String,Integer>, Tuple,
TimeWindow>() {

@Override
public void process(Tuple tuple, Context context, Iterable<Zone1_Sensors>
input, Collector<Tuple2<String,Integer>> out) throws Exception {

int count = 0;
String area = new String();

for (Zone1_Sensors in: input) {

if( in.getLatitude()<40.419643 && in.getLatitude()>=40.419373 ) {

if (in.getLongitude() < -3.705689 && in.getLongitude()>= -3.705999) {
count++;
area = "Zone1-Area1-Zara";
}}

if( in.getLatitude()<=40.419890 && in.getLatitude()>40.419643 ) {

26
if (in.getLongitude() <= -3.705381 && in.getLongitude()> -3.705689) {
count++;
area = "Zone1-Area2-H&M";
}}

if( in.getLatitude()<=40.419999 && in.getLatitude()>40.419890 ) {

if (in.getLongitude() <= -3.705075 && in.getLongitude()> -3.705689) {
count++;
area = "Zone1-Area3-Mango";
}}
}
out.collect(new Tuple2<String,Integer>(area, count));
}
});

1.14.1.3 Location Marketing

Now that we dispose of the number of visitor per zones, instead of just using it
to identify traffic patterns inside the mall, we can also use it to prevent traffic
jam inside the shopping center, to put it in another way, we will control visitors
traffic by sending personalized promotions to individuals based on their
location.
To do so, we will use the previous stream proximitytraffic as follow:
– The goal here, is to control visitor’s traffic, and to do so we will start by
iterating over the previous proximitytraffic stream, and whenever we
find a sum of visitors that exceeds 30 people in some area, we will send
a personalized promotions as a notification in the customer’s app.
– In the normal case, the results of trafficjam stream must be pushed to
a Kafka topic so that they can be pulled by the user’s app to generate
recommendations and advertisements, but since we don’t dispose of a
Mall application, we instead used the textlocal’s messenger API to send
SMS with the Ad "Up to 70% Off on Kids Wear at H&M".
– At the end, we will also push the result of these previous two queries to
Elasticsearch as proximity-traffic and traffic-jam indexes.
The Flink code is below:
// Send an Advertisement to the Customers with Loyalty Programs or that dispose
of The Mall Application

DataStream<Tuple3<Integer,String,String>> trafficJam = proximityTraffic.keyBy(0)

.window(TumblingProcessingTimeWindows.of(Time.minutes(20)))
.process(new ProcessWindowFunction<Tuple2<String, Integer>,
Tuple3<Integer, String, String>, Tuple, TimeWindow>() {

@Override
public void process(Tuple tuple, Context context, Iterable<Tuple2<String,
Integer>> input, Collector<Tuple3<Integer, String, String>> out)
throws Exception {

Tuple3<Integer,String,String> traffic = new Tuple3<>();

for (Tuple2<String,Integer> in: input) {

27
if ( in.f1 >=25 ) {
traffic.f0 = in.f1;
traffic.f1 = in.f0;
// Construct data
String apiKey = "apikey=" + "****************";
String message = "&message=" + "Up to 70% Off on Kids Wear at
H&M";
String sender = "&sender=" + "Mall Discounts";
String numbers = "&numbers=" + "********";

// Send data
HttpURLConnection conn = (HttpURLConnection) new
URL("https://api.txtlocal.com/send/?").openConnection();

String data = apiKey + numbers + message + sender;

conn.setDoOutput(true);
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Length",
Integer.toString(data.length()));

conn.getOutputStream().write(data.getBytes("UTF-8"));
final BufferedReader rd = new BufferedReader(new
InputStreamReader(conn.getInputStream()));

final StringBuffer stringBuffer = new StringBuffer();

String line;
while ((line = rd.readLine()) != null) {
stringBuffer.append(line);
}
rd.close();

traffic.f2 = stringBuffer.toString();
}
else {
traffic.f0 = in.f1;
traffic.f1 = in.f0;
traffic.f2 = "No sign of Traffic Jam at this moment";
}}

out.collect(newTuple3<Integer,String,String>(traffic.f0,traffic.f1,traffic.f2));
}
});
//ElasticSearch Sink Config

List<HttpHost> httpHosts = new ArrayList<>();

httpHosts.add(new HttpHost("172.18.0.9", 9200, "http"));

// create Index for category count

ElasticsearchSink.Builder<Tuple2<String,Integer>> esSinkBuilder = new
ElasticsearchSink.Builder<>(httpHosts, new ElasticsearchSinkFunction<Tuple2
<String,Integer>>() {

public IndexRequest createIndexRequest(Tuple2<String,Integer>

element) {

Map<String, Object> json = new HashMap<>();

json.put("Area", element.f0);
json.put("visitors", element.f1);

28
return Requests.indexRequest()
.index("proximity-traffic")
.type("proximity-traffic")
.source(json);
}

@Override
public void process(Tuple2<String,Integer> element, RuntimeContext ctx,
RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder.setBulkFlushMaxActions(1);
proximityTraffic.addSink(esSinkBuilder.build());

ElasticsearchSink.Builder<Tuple3<Integer,String,String>> esSinkBuilder1 = new

ElasticsearchSink.Builder<>(httpHosts, new ElasticsearchSinkFunction<Tuple3
<Integer,String,String>>() {

public IndexRequest createIndexRequest(Tuple3<Integer,String,String>

element) {
Map<String, Object> json = new HashMap<>();
json.put("visitors", element.f0);
json.put("area", element.f1);
json.put("traffic observation", element.f2);

return Requests.indexRequest()
.index("traffic-jam")
.type("traffic-jam")
.source(json);
}

@Override
public void process(Tuple3<Integer,String,String> element, RuntimeContext ctx,
RequestIndexer indexer) { indexer.add(createIndexRequest(element));
}} );

1.14.2 Study Case 2: In-Store Optimization

1.14.2.1 Define Under-performing Categories

In this section as for the next queries we will rather focus on optimizing the in-
store activities and as a first step, we will develop a query that will help us to
detect the under-performing categories (Accessories, Shoes & Bags, Women,
Kids, Men and Home Care) in a store, and therefore, find a way to boost the
sales without letting it impact the profits.
To do so, we will make use of the POS streams:
– For this use case, we are reading purchase records from POS1,POS2
topics, after that we will keyBy("Product_Category"), so that we can
maintain the count of items per a key.
– Then, we will count the number of products sold, inside a window of 15
min, and finally as a result, the WindowProcessFunction will return a
Tuple containing the category of the product, date of purchase and an

29
updated counter for how many items from that category were sold until
now.
– These results will be then sent to -product-categories- Elasticsearch
index.

public class FlinkQuery3 {

public static void main(String[] args) throws Exception {

ParameterTool parameterTool = ParameterTool.fromArgs(args);

StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameterTool);

env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);

Properties props = new Properties();

props.setProperty("bootstrap.servers", "kafka:9092");
props.setProperty("schema.registry.url", "schema-registry:8081");
props.setProperty("group.id", "RetailProject");

DataStream<PurchaseData> purchaseStream = env.addSource(new

FlinkKafkaConsumer<>("POS1", ConfluentRegistryAvroDeserializationSchema
.forSpecific(PurchaseData.class, "http://schema-registry:8081"), props));

DataStream<PurchaseData> purchaseStream2 = env.addSource(new

FlinkKafkaConsumer<>("POS2", ConfluentRegistryAvroDeserializationSchema
.forSpecific(PurchaseData.class, "http://schema-registry:8081"), props));

// Under-performing Categories : define the number of items sold per category

DataStream<Tuple3<String,String,Integer>> categoryCount =purchaseStream2
.keyBy ("Product_Category")
.window(TumblingProcessingTimeWindows.of(Time.minutes(15)))
.process(new org.apache.flink.streaming.api.functions.windowing
.ProcessWindowFunction<PurchaseData, Tuple3<String, String, Integer>,
Tuple, TimeWindow>() {

@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData> input,
Collector<Tuple3<String, String, Integer>> out) throws Exception {

int count = 0;
String category = new String() ;
String date = new String() ;

for (PurchaseData in: input) {

count++;
category = in.getProductCategory();
date = new java.text.SimpleDateFormat("MM/dd/yyyy
HH:mm:ss").format(new java.util.Date ((in.getTimestamp())*1000));
}
out.collect(new Tuple3<String,String,Integer>(category,date,count));
}});

List<HttpHost> httpHosts = new ArrayList<>();

httpHosts.add(new HttpHost("172.18.0.9", 9200, "http"));

30
ElasticsearchSink.Builder<Tuple3<String,String,Integer>> esSinkBuilder1 = new
ElasticsearchSink.Builder<>(httpHosts,new ElasticsearchSinkFunction
<Tuple3<String,String,Integer>>() {

public IndexRequest createIndexRequest(Tuple3<String,String,Integer>

element) {
Map<String, Object> json = new HashMap<>();
json.put("category", element.f0);
json.put("date", element.f1);
json.put("total_Sales", element.f2);

return Requests.indexRequest()
.index("product-categories")
.type("product-categories")
.source(json);
}

@Override
public void process(Tuple3<String,String,Integer> element,
RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder1.setBulkFlushMaxActions(1);
categoryCount.addSink(esSinkBuilder1.build());

1.14.2.2 Customer Payment Preference

The goal from this query is the same as every In-store optimization method, to
improve the customer experience, by detecting the preferences of your visitors
and therefore review the flexibility of your POS system, with the aim of
creating new opportunities to deepen the relationship with your customers.
Here we have the same flow of functions used in the previous query only with
a different output: Tuple2<String,Integer>(payment, count) that will be pushed
to a different index store-payment.
The implementation is as follow:

DataStream<Tuple2<String,Integer>> paymentMethod = purchaseStream.keyBy

("Payment_Method").window(TumblingProcessingTimeWindows
.of(Time.minutes(15)))
.process(new org.apache.flink.streaming.api.functions.windowing
.ProcessWindowFunction<PurchaseData, Tuple2<String, Integer>, Tuple,
TimeWindow>() {

@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData> input,
Collector<Tuple2<String, Integer>> out) throws Exception {
int count = 0;
String payment = new String() ;
for (PurchaseData in: input) {
count++;
payment= in.getPaymentMethod();
}

31
out.collect(new Tuple2<String,Integer>(payment, count));
}
});

List<HttpHost> httpHosts = new ArrayList<>();

httpHosts.add(new HttpHost("172.18.0.9", 9200, "http"));

ElasticsearchSink.Builder<Tuple2<String,Integer>> esSinkBuilder2 = new

ElasticsearchSink.Builder<>(httpHosts,new ElasticsearchSinkFunction
<Tuple2<String,Integer>>() {

public IndexRequest createIndexRequest(Tuple2<String,Integer>

element) {

Map<String, Object> json = new HashMap<>();

json.put("payment_Method", element.f0);
json.put("count", element.f1);

return Requests.indexRequest()
.index("store-payment")
.type("store-payment")
.source(json);
}
@Override
public void process(Tuple2<String,Integer> element, RuntimeContext ctx,
RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder2.setBulkFlushMaxActions(1);
paymentMethod.addSink(esSinkBuilder2.build());

1.14.2.3 Inventory Checking

Inventory checking consists of maintaining a good stock control in order to

increase efficiency, avoid abandoned purchases and having a more organized
warehouse. To perform this operation, we have done as follow:
– First, we kept the time window for processing events as 15 minutes,
and used a process transformation in which we defined a threshold of
30 products. If the number of items available in stock is below this
number then it will be considered as critical and trigger a message that
says "Item is below the System Quantity, needs REORDERING" .
– Otherwise we will display "Item Quantity is sufficient, in this way we
will keep track of which products we have at hand and which we need
to reorder.
– The resultant stream will have the Productid, NumberAvailableStock,
and the displaying message fields stored in stock-availability
index.

32
The program is defined below :

// Inventory Checking
DataStream<Tuple3<Long,Integer,String>> stockAvailability = purchaseStream
.keyBy("productid")
.timeWindow(Time.minutes(15))
.process(new org.apache.flink.streaming.api.functions.windowing
.ProcessWindowFunction<PurchaseData, Tuple3<Long, Integer, String>,
Tuple, TimeWindow>() {

@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData>
elements, Collector<Tuple3<Long, Integer, String>> out) throws Exception {

Tuple3<Long, Integer, String> result = new Tuple3<>();

String critical = "Item is below the System Quantity, needs REORDERING";
String available = "Item Quantity is sufficient";

for (PurchaseData in : elements) {

if (in.getNumberAvailableStock() < 30) {
result.f0 = in.getProductid();
result.f1 = in.getNumberAvailableStock();
result.f2 = critical;
} else {
result.f0 = in.getProductid();
result.f1 = in.getNumberAvailableStock();
result.f2 = available;
}}

out.collect(new Tuple3<Long, Integer,String>(result.f0,result.f1,result.f2));

}
});

List<HttpHost> httpHosts = new ArrayList<>();

httpHosts.add(new HttpHost("172.18.0.9", 9200, "http"));

ElasticsearchSink.Builder<Tuple3<Long,Integer,String>> esSinkBuilder3 = new

ElasticsearchSink.Builder<>(httpHosts, new ElasticsearchSinkFunction
<Tuple3<Long,Integer,String>>() {

public IndexRequest createIndexRequest(Tuple3<Long,Integer,String>

element) {
Map<String, Object> json = new HashMap<>();
json.put("product_ID", element.f0);
json.put("stock_check", element.f1);
json.put("observation", element.f2);

return Requests.indexRequest()
.index("stock-availability")
.type("stock-availab")
.source(json);
}

@Override
public void process(Tuple3<Long,Integer,String> element,
RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}});
33
esSinkBuilder3.setBulkFlushMaxActions(1);
stockAvailability.addSink(esSinkBuilder3.build());

1.14.2.4 Compare sales potential

Finally, we will construct a metric that will measure the store daily sales in the
aim of creating sales report for each month. In this query, the transformation
will be simpler:
it consist of getting from the POS stream the amount paid by shoppers in each
purchase along with the date, then storing the added results to Elasticsearh
store-sale index.
Flink code is the following :

DataStream<Tuple2<String,Integer>> totalSales = purchaseStream2.keyBy

("Product_Category").timeWindow(Time.minutes(30))
.process(new ProcessWindowFunction<PurchaseData, Tuple2<String, Integer>,
Tuple, TimeWindow>() {

@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData>
elements, Collector<Tuple2<String, Integer>> out) throws Exception {

int totalSales = 0;
String date = new String() ;

for (PurchaseData in: elements) {

totalSales=in.getAmount();
date = new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss")
.format(new java.util.Date ((in.getTimestamp())*1000));
}
out.collect(new Tuple2<String, Integer>(date, totalSales));
} });

List<HttpHost> httpHosts = new ArrayList<>();

httpHosts.add(new HttpHost("172.18.0.9", 9200, "http"));

ElasticsearchSink.Builder<Tuple2<String,Integer>> esSinkBuilder4 = new

ElasticsearchSink.Builder<>(httpHosts,new ElasticsearchSinkFunction
<Tuple2<String,Integer>>() {

public IndexRequest createIndexRequest(Tuple2<String,Integer> element) {

Map<String, Object> json = new HashMap<>();
json.put("date", element.f0);
json.put("total_Sales", element.f1);

return Requests.indexRequest()
.index("store-sales")
.type("store-sale")
.source(json);
}
@Override
public void process(Tuple2<String,Integer> element, RuntimeContext ctx,
RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}} );

34
esSinkBuilder4.setBulkFlushMaxActions(1);
totalSales.addSink(esSinkBuilder4.build());

env.execute();

This previous developed metrics will help the owners quantify the customer’s
journey inside the mall, determine a particular customer's purchase pattern
and therefore increase the sales…etc.
Nevertheless, there is an infinite number of metrics that can be deployed and
others that can be improved, and to specify this, you need to clearly define the
problem at hand from the beginning in order to determine the adequate metrics
that can solve it.

1.15 Data Visualization and Distribution

Now that we have the results of the queries stored in Elasticsearch indexes, the
analytics team will use these information to monitor the retail metrics. In this
project, we used kibana visualization features to demonstrate the data
exploration.

This section will then illustrate the visualizations created for the dashboard
from the Elasticsearch indices. The first visualization created was a stacked
line chart for monitoring Mall Foot Traffic inside the shopping center:

Figure 19 : Mall Foot Traffic Chart

35
This chart will give the owners an idea about how the flow of visitors changes
in different zones of the Mall per media, event and timing, it will help them
detect in which period of time visitors tend to shop. For examples in
timestamp 1583026800 which corresponds to 1/3/2020 à 1:40:00, we can
clearly see there’s a spike that may be explained by a period of discount or
some kind of event inside the mall.

Another chart for in-mall analytics would be a heat map that will be updated
with the number of visitors present in each area:

Figure 20 : In-Mall Proximity Traffic Heat Map

This visualization will provide information about visitors’ presence in front of

each store, the colors of heat map change according the number of people in
each area.
Next, we will implement stacked pie chart to represent the number of visitors
in each area in Mall zones, along with the possibility of having a traffic jam:

36
Figure 21 : Stacked Pie chart for Traffic Control

This visualization gives a clear view on the number of visitors present in each
area along with the possibility of having traffic jam in front of a store.
If the number of people exceeds 30, then the Mall App will trigger a push
notification to the visitors in that area ( the ones having the app installed in
their phones), with recommendations and discounts available in other stores,
this way we can control traffic flow and prevent theft attempts.
Now, for in-store metrics, we will start by creating a Horizontal bar chart, to
represent the number of items sold by each store category as follow:

Figure 22 : Bar Chart for Categories Sales

37
For this period of time, we clearly see that there’s a small difference in the
number of items sold for each category, this would help the retailers to
think of new strategies to boost the sales in the underperforming categories,
such as: buy one item from Category X, you get -50% on item in Category Y.

Next, we will move to visualize the customer payment preference, by using

another bar chart as shown in the next page.

Figure 23 : Bar Chart for Payment Preference

This type of analytics aim to capture the consumer payment preferences, so

that we can adjust the payment process according to the customer
expectations, that will include, upgrading the point of sales systems in case
we have more demand with mobile payment, provide more change at the
cashiers if cash is the winner, or make sure that the POS terminals
implement an efficient credit card authorizations.
Now, let’s move to product tracking, and to do so we will implement a pie
chart that will be updated, each two hours:

38
Figure 24 : Pie chart for stock control

The main goal behind this metric is to prevent lost sales, by monitoring the
stock and having up-to-date information about the store health. We can use
this metric not only to help planning new inventory purchases, but also to
estimate product-demand and detect the best-selling products. In our case,
it will trigger a notification to inform the analytical team whenever inventory
levels are out of sync with demand, therefore we will provide retailers with
enough time to increase their orders before a product can go viral (in the
query we put a threshold of 30 items, per product).
And last but not least, we will create an area chart to quantify sales growth:

Figure 25 : Area chart for sales growth

This metric will be used to build report sales, it will include the total sales
revenue, profits and analysis for sales growth. This will determine how
efficiently the company is making money, how the business profitability is
39
changing and if the current business strategy is effective or does it need
restructuring.
We can see in the chart, that in the same day the total sales may increase or
decrease, that’s because we took into consideration the items that
customers might took, and then return for some reason, which will result in
sales decrease.

To summarize this part of the thesis, we must say that whenever a company
wants to perform any kind of visualization, it must first decide the goal or
the value behind it, only then the generated insights can help create a more
effective retail optimization solutions.

40
Chapter 5 Conclusion
The goal of the this project was to develop a stream processing pipeline able to
examine individual shopping behavior with the aim to help Mall managers and
Store owners to quickly react on discovered insights while it still counts. Our
architecture brings together the stream processing capabilities, real-time
analytics and visual dashboards into one solution.

Thus, before jumping to the implementation of our software architecture,

Chapter 2 of this project explored the basic Big Data concepts and technologies
along with the importance of stream processing today, to face retail industry
challenges of having applications and analytics react to events instantly,
reflecting the data when it stills meaningful. With this in mind, many
frameworks have been developed to support event streaming such as: Apache
Flink, Strom, Spark and many more.

Chapter 3 focused on exploring the main components used to build our system
architecture and how they will interact with each other. For generating the
sensor and point-of-sale data, Kafka Connect Datagen was used along with the
Confluent Schema registry to produce Avro messages, those records will be then
sent to Kafka topics so they can be pulled by the streaming processing
framework Flink to be combined and analyzed afterwards. Then, the result of
Flink transformations will be sent to Elasticsearch, a search engine that will
store the results as indexes so that Kibana, the visualization and exploration
tool, can easily query and analyze them to deliver quick insights.

In Chapter 4, the implementation of the previous mentioned architecture was

discussed. We started by setting up the environment using Docker containers
and managing them by docker-compose. Next, we moved to data generation,
where we defined the Avro schemas for sensor and point-of-sale data, this
schemas will be available at the Confluent Schema Registry server, and will be
accessed by Kafka Connect datagen connector to produce records into Kafka
topics. Now that we made sure we have enough data to work on, we moved to
develop Flink programs according to the predefined metrics, and which can be
divided into two sections: First, Mall Analytics, where Mall managers look for
effective Marketing solutions using current visitors location( Location
Marketing), this way they will be able to present the right offer at the right time
of the shopper visit, moreover, the Mall-Foot –Traffic as well as In-Proximity-

41
Traffic helps the managers track the visitors patterns inside the mall and
therefore have an optimized visibility about the traffic jam and the most used
routes inside. Secondly, In-Store Optimization, in this part we developed Flink
programs that can help the store owners investigate the shopper experience
with the aim to address their needs and expectations. In detail, the first program
will help identify the underperforming product categories inside the store
(accessories, kids, woman…), this way they will be aware of what area needs to
be optimized and products are not selling. The second one will help with
monitoring the inventory levels, to make sure products are never out-of-stock,
especially in today’s market where customers have many choices in front of
them, and mostly just a click away. The third program will try to detect the
customer payment preferences so that they can adapt offers to them and opt for
an up-to-date point-of-sales systems, this is a part of defining a customer’s
profile to provide more personalized experience. The last metric’s goal, is to have
a history of store’s daily sales in order to construct sales reports and have
visibility on the effectiveness of the current business strategies. All these
previous results will be stored into Elasticsearch indexes, and updated after
each computation, then Kibana will connect to those indexes and use the stored
data to generate visualization reports, dynamic dashboards and alerts.

In summary, we have implemented in this project a general retail analytics

architecture can that manages multiple heterogeneous data streams to solve
the major issues that arises within the retail industry : creating unique
customer experience, having a more optimized inventory management,
personalize offerings, improve decisions and more importantly having a real-
time feeds that allows for real-time action on significant events .

All things considered, this project can be expanded to by adding another storage
layer to our system architecture, such as Hbase or Cassandra, where we can
store results for more deep analysis, and also by implementing other Machine
learning programs that can predict trends, customer behavior and inventory
needs (determining when it is time to reorder, and knowing when a shortage
might occur) and therefore make better decisions.

42
2 Bibliography
[1] https://behavioranalyticsretail.com/mall-analytics/

[2] https://books.google.es/books?id=hQ4_AQAAQBAJ&printsec=frontcover

[3] https://www.straitstimes.com/lifestyle/fashion/zaras-secret-to-success-
lies-in-big-data-and-an-agile-supply-chain

[4] https://www.confluent.io/blog/easy-ways-generate-test-data-
kafka/?utm_source=github&utm_medium=demo&utm_campaign=ch.examples
_type.community_content.top

[5] https://medium.com/@raymasson/kafka-elasticsearch-connector-
fa92a8e3b0bc

[6] https://www.ververica.com/blog/kafka-flink-a-practical-how-to

[7] https://www.ververica.com/blog/using-apache-flink-for-smart-cities-the-
case-of-warsaw

[8] https://docs.confluent.io/current/schema-registry/index.html

[9] https://ci.apache.org/projects/flink/flink-docs-release-1.11/try-
flink/flink-operations-playground.html

[10] https://www.vitria.com/pdf/SolutionOverview-Retail.pdf

[11] https://www.ververica.com/flink-forward-berlin-
2018?__hstc=37051071.532e807f835377e1b0c3e432ada7ea45.15952633311
45.1595263331145.1595263331145.1&__hssc=37051071.1.1595263331146
&__hsfp=528206255

[12] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
aggregations-bucket-terms-aggregation.html

[13] https://github.com/confluentinc/avro-random-generator

43
3 Annexes
Annex 1

Docker-compose.yaml
---
version: '2'
services:
zookeeper:
image: wurstmeister/zookeeper:latest
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_TICK_TIME: 2000
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181

kafka:
image: wurstmeister/kafka:latest
container_name: kafka
ports:
- "9092:9092"
- "9093:9093"
depends_on:
- zookeeper
expose:
- "9093"
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENERS: INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9092,OUTSIDE://localhost:9093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

schema-registry:
image: confluentinc/cp-schema-registry:5.0.0
container_name: schema-registry
depends_on:
- zookeeper
- kafka
ports:
- '8081:8081'
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092

connect-1:
image: confluentinc/kafka-connect-datagen:0.3.0
build:
context: .
dockerfile: Dockerfile-confluenthub
container_name: connect-1
restart: always
ports:
- "8083:8083"
depends_on:
- zookeeper
- kafka
- schema-registry
environment:
CONNECT_BOOTSTRAP_SERVERS: kafka:9092

44
CONNECT_REST_PORT: 8083
CONNECT_HOST_NAME: connect-1
CONNECT_GROUP_ID: kafka-connect-sensordata
CONNECT_REST_ADVERTISED_HOST_NAME: connect-1
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
CONNECT_KEY_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
CONNECT_INTERNAL_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_PRODUCER_INTERCEPTOR_CLASSES:
"io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor"
CONNECT_CONSUMER_INTERCEPTOR_CLASSES:
"io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"
CONNECT_ZOOKEEPER_CONNECT: 'zookeeper:2181'
CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components"
CONNECT_LOG4J_LOGGERS:
"org.apache.kafka.connect.runtime.rest=WARN,org.reflections=ERROR"

kafka-topics-ui:
image: landoop/kafka-topics-ui:latest
depends_on:
- zookeeper
- kafka
- rest-proxy
ports:
- "8000:8000"
environment:
KAFKA_REST_PROXY_URL: 'rest-proxy:8082'
PROXY: "true"

rest-proxy:
image: confluentinc/cp-kafka-rest:5.3.1
container_name: rest-proxy
depends_on:
- zookeeper
- kafka
- schema-registry
ports:
- "8082:8082"
environment:
KAFKA_REST_HOST_NAME: rest-proxy
KAFKA_REST_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092
KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'

kafka-connect-ui:
image: landoop/kafka-connect-ui:latest
depends_on:
- zookeeper
- kafka
- connect-1
ports:
- "8001:8000"
environment:
CONNECT_URL: 'connect-1:8083'

jobmanager:
image: flink:latest
expose:

45
- "6123"
ports:
- "8089:8089"
command: jobmanager
links:
- zookeeper
- kafka
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager

taskmanager:
image: flink:latest
expose:
- "6121"
- "6122"
depends_on:
- jobmanager
command: taskmanager
links:
- "jobmanager:jobmanager"
- kafka
- zookeeper
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager

kibana:
image: docker.elastic.co/kibana/kibana:7.7.1
restart: always
ports:
- "5601:5601"
environment:
ELASTICSEARCH_URL: http://172.18.0.9:9200
ELASTICSEARCH_HOSTS: http://172.18.0.9:9200
depends_on:
- elasticsearch

elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.7.1
environment:
- xpack.security.enabled=false
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- cluster.routing.allocation.disk.threshold_enabled=false
ulimits:
memlock:
soft: -1
hard: -1
ports:
- "9200:9200"
expose:
- "9200"

46
Annex 2
{ "namespace": "com.SensorData",
"name": "Zone1_Sensors",
"type": "record",
"fields": [
{"name":"sensor_UID", "type": {
"type":"string",
"arg.properties":{
"options": [ "e74e8480-42da-46c1-b197-584e6030120c",
"e4b9e4b4-4713-419d-80cf-17a21c48d5da",
"90330f07-b71b-4e60-9b16-cd11479f78f7",
"d0b5a283-9bb9-46f6-9a55-7f13fd280481",
"88b63897-b90e-44cb-a455-7b67ff051dc0",
"7689b71e-9899-4e70-a1c7-a019aef39f9e",
"dbea6409-289e-4bba-8866-78a967980350",
"84e05ffe-4f5f-4e72-a5ed-857bb7a90002",
"542a5ea7-6312-4142-9682-b128c8ef444d",
"140faa0c-4be8-4a96-8e51-aca4fe0541f4"
]
}
}},
{"name": "MAC", "type": {
"type": "string",
"arg.properties": {
"options": [ "CE-F2-94-BD-00-B7", "97-AF-38-A3-E9-C7","7E-4C-
CF-8C-76-B6", "37-1C-0D-74-5F-59","AE-1B-41-96-E4-E6","8A-0D-32-D1-69-4E","85-36-04-
91-79-90","15-08-B3-41-77-30","99-E9-FD-05-84-DB","C3-F4-4C-30-15-D8","46-22-18-F1-
0F-FC","8E-F7-57-A5-83-BE","D7-D0-57-41-A8-D3", "97-28-40-58-BB-24","AF-17-37-25-07-
66", "0E-A7-86-16-8A-4C", "E0-8A-AA-4C-8F-DD", "47-D9-45-1B-91-BD", "10-53-2F-17-95-
67", "C2-2E-40-46-E1-4D" ]
}
}},
{"name": "ip", "type": {
"session": "true",
"type":"string",
"arg.properties": {
"options":["111.152.45.45",
"111.203.236.146",
"111.168.57.122",
"111.249.79.93",
"111.168.57.122",
"111.90.225.227",
"111.173.165.103",
"111.145.8.144",
"111.245.174.248",
"111.245.174.111",

"222.152.45.45",
"222.203.236.146",
"222.168.57.122",
"222.249.79.93",
"222.168.57.122",
"222.90.225.227",
"222.173.165.103",
"222.145.8.144",
"222.245.174.248",
"222.245.174.222",

"122.152.45.245",
"122.203.236.246",
"122.168.57.222",
"122.249.79.233",
"122.168.57.222",
"122.90.225.227",

47
"122.173.165.203",
"122.145.8.244",
"122.245.174.248",
"122.245.174.122",

"233.152.245.45",
"233.203.236.146",
"233.168.257.122",
"233.249.279.93",
"233.168.257.122",
"233.90.225.227",
"233.173.215.103",
"233.145.28.144",
"233.245.174.248",
"233.245.174.233"
]
}
}
},
{"name": "latitude", "type": {
"type": "double",
"arg.properties": {
"range": {
"min": 40.419373,
"max": 40.419999
}
}
}},
{"name": "longitude", "type": {
"type":"double",
"arg.properties":{
"range":{
"min": -3.705999,
"max": -3.705075
}
}
}},
{
"name": "Timestamp","type": {
"type": "long",
"arg.properties":{
"iteration":{
"start" : 1583024400,
"restart": 1590973200,
"step": 300
}}
}
}
]
}

PurchaseData.avsc:

{ "namespace": "com.SensorData",
"name": "PurchaseData",
"type": "record",
"fields": [
{
"name": "Seller_Name","type": {
"type": "string",
"arg.properties":{
"options": [ "Seller_1", "Seller_2", "Seller_3", "Seller_4"
]}
48
}},
{"name": "productid",
"type": {
"type": "long",
"arg.properties": {
"iteration":{
"start" : 1,
"restart": 30
}}
}
},
{
"name": "Product_Category","type": {
"type": "string",
"arg.properties":{
"options": [ "Accessories", "Shoes & Bags", "Women", "Kids",
"Men", "Home Care"
]}
}},
{
"name": "Number_Available_stock",
"type": {
"type": "int",
"arg.properties": {
"range":{
"min": 0,
"max": 100
}
}
}},
{
"name": "Payment_Method","type": {
"type": "string",
"arg.properties":{
"options": [ "Cash", "Mobile Payment", "Credit Card"
]}
}},
{"name": "Amount",
"type": {
"type":"int",
"arg.properties":{
"range":{
"min": 3,
"max": 200
}
}
}},
{"name": "Loyalty_Card","type": {
"type": "string",
"arg.properties":{
"options": [ "true" , "false" ]
}}},
{
"name": "Timestamp","type": {
"type": "long",
"arg.properties":{
"iteration":{
"start" : 1583024400,
"restart": 1588294861,
"step": 86400
}}
}
}
]
}

Video Analytics Solution For Tracking Customer Loc
No ratings yet
Video Analytics Solution For Tracking Customer Loc
5 pages
"Big Data Analysis For Customer Behaviour": A Seminar Report
100% (1)
"Big Data Analysis For Customer Behaviour": A Seminar Report
15 pages
Sandeep Proposal
No ratings yet
Sandeep Proposal
6 pages
Big Data's Impact on Retail
0% (1)
Big Data's Impact on Retail
4 pages
Executive Summary: (CITATION Eri172 /L 1033)
No ratings yet
Executive Summary: (CITATION Eri172 /L 1033)
10 pages
Rajput Fam
No ratings yet
Rajput Fam
5 pages
Retail Insight Generator Using Retrieval-Augmented Generation
No ratings yet
Retail Insight Generator Using Retrieval-Augmented Generation
5 pages
ASA-Mall Management System: Abhishek Chaturvedi, Anindita Chakraborti, Shubhi Bartaria, Rahul Neve
No ratings yet
ASA-Mall Management System: Abhishek Chaturvedi, Anindita Chakraborti, Shubhi Bartaria, Rahul Neve
4 pages
Capegemini
No ratings yet
Capegemini
6 pages
Walmart's Sales Data Analysis - A Big Data
No ratings yet
Walmart's Sales Data Analysis - A Big Data
6 pages
Capturing & Analyzing High Velocity High Volume Machine Data
No ratings yet
Capturing & Analyzing High Velocity High Volume Machine Data
12 pages
Retail Analytics Using Big Data Technologies
No ratings yet
Retail Analytics Using Big Data Technologies
8 pages
DABI - Final Assignment - Arif - Shayekh
No ratings yet
DABI - Final Assignment - Arif - Shayekh
12 pages
Formatted Big Mart Sale Analysis
No ratings yet
Formatted Big Mart Sale Analysis
15 pages
Python Mall Management Project
No ratings yet
Python Mall Management Project
51 pages
Artificial Intelligence and Expert Systems SEM-VI
No ratings yet
Artificial Intelligence and Expert Systems SEM-VI
26 pages
Data Analytics For Product Segmentation and Demand Forecasting of A Local Retail Store Using Python
No ratings yet
Data Analytics For Product Segmentation and Demand Forecasting of A Local Retail Store Using Python
8 pages
Dr. J. John Manoharan - PDF - IARA APRIL 19
No ratings yet
Dr. J. John Manoharan - PDF - IARA APRIL 19
3 pages
Trakomatic - Digitizing Visitor Behaviours
No ratings yet
Trakomatic - Digitizing Visitor Behaviours
33 pages
Big Mart Project Report
No ratings yet
Big Mart Project Report
19 pages
Data Analysis On BigMart Sales
67% (3)
Data Analysis On BigMart Sales
17 pages
SSRN 42487381
No ratings yet
SSRN 42487381
15 pages
Gsms Survey Paper
No ratings yet
Gsms Survey Paper
5 pages
Kenobi Password Manager
No ratings yet
Kenobi Password Manager
41 pages
Amazon's Big Data Strategy
No ratings yet
Amazon's Big Data Strategy
11 pages
Review 1
No ratings yet
Review 1
11 pages
Retail Data Analysis in Istanbul - Demo - Guide File
No ratings yet
Retail Data Analysis in Istanbul - Demo - Guide File
25 pages
Part 2
No ratings yet
Part 2
21 pages
Tiya Case Study 2
No ratings yet
Tiya Case Study 2
5 pages
Document (20) - 1
No ratings yet
Document (20) - 1
8 pages
Big Data Final Proj
No ratings yet
Big Data Final Proj
26 pages
Retail Case Study
No ratings yet
Retail Case Study
5 pages
Final Project Phase 3
No ratings yet
Final Project Phase 3
26 pages
Git Ce 12 25
No ratings yet
Git Ce 12 25
86 pages
Git Ce 12
No ratings yet
Git Ce 12
86 pages
SEMINAR - REPORT (1) (AutoRecovered)
No ratings yet
SEMINAR - REPORT (1) (AutoRecovered)
17 pages
20250429-EB-DSG Special Edition Retail
No ratings yet
20250429-EB-DSG Special Edition Retail
23 pages
Final Viva-Voce (12 Sept 2024) For MCA by Research On: Predictive Analysis For Big Mart Sales
No ratings yet
Final Viva-Voce (12 Sept 2024) For MCA by Research On: Predictive Analysis For Big Mart Sales
23 pages
BDA - Research Paper 3
No ratings yet
BDA - Research Paper 3
8 pages
Minor Project
No ratings yet
Minor Project
33 pages
A PPVC Report On "Google Playstore Insights" Department of Computer Science and Engineering (Data Science)
No ratings yet
A PPVC Report On "Google Playstore Insights" Department of Computer Science and Engineering (Data Science)
30 pages
Amazon E-commerce Data Analysis
No ratings yet
Amazon E-commerce Data Analysis
45 pages
Review 3
No ratings yet
Review 3
21 pages
Sarika Certificate
No ratings yet
Sarika Certificate
7 pages
Enhancing Customer Experience Leveraging Data Engineering and AI in Retail Analytics
No ratings yet
Enhancing Customer Experience Leveraging Data Engineering and AI in Retail Analytics
7 pages
Project Analysis of Shopping Trends Using Data Analytics
No ratings yet
Project Analysis of Shopping Trends Using Data Analytics
4 pages
Risk PDF
No ratings yet
Risk PDF
41 pages
Report (Afrah)
No ratings yet
Report (Afrah)
15 pages
Sou - Poster Template - 2024
No ratings yet
Sou - Poster Template - 2024
1 page
Sairam Project
No ratings yet
Sairam Project
9 pages
A Product Network Analysis Using A Priori Algorithm For Extending The Market Basket in Retail
No ratings yet
A Product Network Analysis Using A Priori Algorithm For Extending The Market Basket in Retail
12 pages
Final Report - PBL
No ratings yet
Final Report - PBL
15 pages
Big Data Project
No ratings yet
Big Data Project
18 pages
Project Report Guidelines
No ratings yet
Project Report Guidelines
15 pages
GU Big Data Analytics Unsolved Case Study
No ratings yet
GU Big Data Analytics Unsolved Case Study
2 pages
Feasibility Report For Shopping Mall Special Reference To Hyderabad
100% (2)
Feasibility Report For Shopping Mall Special Reference To Hyderabad
118 pages
Seyed Mohammad Elahi
No ratings yet
Seyed Mohammad Elahi
163 pages
TFM Riccardo Pala
No ratings yet
TFM Riccardo Pala
41 pages
Artur Andrzej Jarzabek
No ratings yet
Artur Andrzej Jarzabek
128 pages
Dragan Ivanovic
No ratings yet
Dragan Ivanovic
192 pages
Manuel Colera Rico 03
No ratings yet
Manuel Colera Rico 03
25 pages
TFM Marina Herrero Juanco
No ratings yet
TFM Marina Herrero Juanco
63 pages
Cesar Polindara
No ratings yet
Cesar Polindara
142 pages
Bruno Emilio Cuevas Zuviria
No ratings yet
Bruno Emilio Cuevas Zuviria
138 pages
TFM Jenifer Tabita Ciuciu-Kis
No ratings yet
TFM Jenifer Tabita Ciuciu-Kis
83 pages
Manuel Colera Rico 01
No ratings yet
Manuel Colera Rico 01
19 pages
TFM Marta Corrochano Garrido
No ratings yet
TFM Marta Corrochano Garrido
50 pages
Ismael Ahrazem Dfuf
No ratings yet
Ismael Ahrazem Dfuf
238 pages
TFM Thanjira Amornkosit
No ratings yet
TFM Thanjira Amornkosit
116 pages
B2 Answer Key
No ratings yet
B2 Answer Key
1 page
Modelo Examen ACLES-B2
No ratings yet
Modelo Examen ACLES-B2
24 pages
Lesson 27 Writing A Review Mult. Chce. CLZ PDF
No ratings yet
Lesson 27 Writing A Review Mult. Chce. CLZ PDF
12 pages
Lesson 43 Exam Seaking Practice PDF
No ratings yet
Lesson 43 Exam Seaking Practice PDF
7 pages
Modulo18 RiscV DDCArv Ch8
No ratings yet
Modulo18 RiscV DDCArv Ch8
43 pages
Program FIFO CPU Scheduling Alogrithm
No ratings yet
Program FIFO CPU Scheduling Alogrithm
3 pages
Classification Basics
No ratings yet
Classification Basics
14 pages
HP 28S Quick Reference
No ratings yet
HP 28S Quick Reference
52 pages
Product Lifecycle and End of Life Information For Broadcom, Symantec, and VMware Products
No ratings yet
Product Lifecycle and End of Life Information For Broadcom, Symantec, and VMware Products
2 pages
MODEL NO.: V390HJ1 Suffix: Le6: Product Specification
No ratings yet
MODEL NO.: V390HJ1 Suffix: Le6: Product Specification
38 pages
Electrical Grounding Guide
No ratings yet
Electrical Grounding Guide
1 page
NoSQL vs RDBMS: A Modern Shift
100% (1)
NoSQL vs RDBMS: A Modern Shift
142 pages
Image Guidelines
No ratings yet
Image Guidelines
44 pages
Prac 1 - 6
No ratings yet
Prac 1 - 6
16 pages
Digital Detox Podcast: Improve Listening Skills
No ratings yet
Digital Detox Podcast: Improve Listening Skills
6 pages
1994 Volvo 960 Instrument Panel Guide
No ratings yet
1994 Volvo 960 Instrument Panel Guide
7 pages
Lec-9 Flow Measurement With Numericals
No ratings yet
Lec-9 Flow Measurement With Numericals
38 pages
Panasonic TC-P42st30 Service Manual
No ratings yet
Panasonic TC-P42st30 Service Manual
114 pages
System Design Interview Guide
No ratings yet
System Design Interview Guide
33 pages
52 Week Photography Challenge
0% (1)
52 Week Photography Challenge
2 pages
14 Packers and Tubing Movement 13
No ratings yet
14 Packers and Tubing Movement 13
34 pages
Stock Market Prediction Using Hidden Markov Model
No ratings yet
Stock Market Prediction Using Hidden Markov Model
4 pages
R1900 Clue Book Ver 2 PDF
100% (2)
R1900 Clue Book Ver 2 PDF
32 pages
English Model Paper
No ratings yet
English Model Paper
6 pages
G5 Final Revision 1st Term Questions Only ICT
No ratings yet
G5 Final Revision 1st Term Questions Only ICT
15 pages
Complete Opportunistic Networking: Vehicular, D2D and Cognitive Radio Networks 1st Edition Nazmul Siddique PDF For All Chapters
100% (1)
Complete Opportunistic Networking: Vehicular, D2D and Cognitive Radio Networks 1st Edition Nazmul Siddique PDF For All Chapters
65 pages
Neuralink: Brain Implants & AI Integration
No ratings yet
Neuralink: Brain Implants & AI Integration
1 page
ASME Anchor Flange Calculations
No ratings yet
ASME Anchor Flange Calculations
4 pages
Audio Production and Critical Listening Technical Ear Training 2nd Edition Jason Andrew Corey PDF Download
No ratings yet
Audio Production and Critical Listening Technical Ear Training 2nd Edition Jason Andrew Corey PDF Download
83 pages
Opt 2023
No ratings yet
Opt 2023
645 pages
FG Wilson Service Bulletin: SB 0087 - PCA
100% (1)
FG Wilson Service Bulletin: SB 0087 - PCA
4 pages
Two-Port Network Parameters Guide
No ratings yet
Two-Port Network Parameters Guide
108 pages
Common Machine Learning Issues
No ratings yet
Common Machine Learning Issues
2 pages
Design and Structural Analysis of Car Alloy Wheel Using With Various Materials
No ratings yet
Design and Structural Analysis of Car Alloy Wheel Using With Various Materials
7 pages