TFM Widad El Abbassi
TFM Widad El Abbassi
de Madrid
Escuela Técnica Superior de
Ingenieros Informáticos
July 2020
Supervisor:
Marta Patiño-Martínez
LENGUAJES Y SISTEMAS INFORMÁTICOS E INGENIERÍA DE
SOFTWARE
ETSI Informáticos
Universidad Politécnica de Madrid
Abstract
Stream processing technologies are becoming more and more popular within
the retail industry whether we are talking about the physical stores or the e-
commerce. Retailers and mall managers are becoming extremely competitive
among them to provide the best solutions that meet customer expectations. This
can be achieved by studying the customer shopping behavior inside the
shopping center and stores, which will provide us with information about the
pattern of shopping activities and movements of consumers, we can track what
are the most common used routes inside the mall, when we can have a peak
occupancy and what kind of brands or products attracts one more than another
(determine shopper profile). This gathered data will then allow us to make more
affective decisions, about the store layout, product positioning, marketing,
controlling the traffic jam and more.
The purpose of this project is to examine the consumer behavior inside shopping
centers and stores. In particular, we want to generate insights regarding two
types of analytics: Mall Analytics, by measuring the foot traffic, In-mall
proximity traffic and location marketing with the intention to help mall
managers improve security and advertising. Then In-Store Analytics, to help
retailers define the underperforming product categories, compare sales potential
as well as improve the inventory management. Although, the real challenge
doesn’t only lie in storing and managing this huge amount of data but also in
accessing the results and providing reports in real time. With this in mind our
work propose a real-time data processing architecture able to ingest, analyze
and generate visualization reports almost immediately. In detail, the first
component in the proposed pipeline is Kafka connect framework, it will be
responsible for generating continuous flows of sensors and POS (point of sale)
data that will be sent after to the second component, Apache Kafka, a
distributed messaging system that will store those incoming messages into
multiple Kafka topics (for instance: sensor1 in zone1 area1 inside the mall, will
be stored in a particular topic1). The third component in this architecture will
be the processing unit , Apache Flink, a streaming dataflow engine and scalable
data analytics framework that deliver data analytics in real time, one of its most
interesting features is the usage of event timestamp to build time windows for
computations, in this section, several Flink queries will be developed to measure
i
the pre-defined metrics (Mall Foot Traffic, location Marketing…).The fourth
component will be a real-time search and analytic engine, Elasticseach, in
which the results of the previous queries will be stored in indexes and then used
by the final component, Kibana, a powerful visualization tool to deliver insights
and dynamic visualization reports.
Our work consists of implementing this streaming analytics pipeline that will
help mall managers and retailers investigate the whole shopping process, thus,
design more effective development plans and marketing strategies.
ii
Contents
Chapter 1 Introduction ............................................................................1
1.1 Motivation .......................................................................................... 1
1.2 Goal ................................................................................................... 1
1.3 Thesis organization ............................................................................ 2
Chapter 2 Background .............................................................................3
1.4 Big Data ............................................................................................. 3
1.5 The importance of stream processing ................................................. 4
1.6 Stream Processing frameworks ........................................................... 5
1.6.1 Apache Spark............................................................................... 5
1.6.2 Apache Storm .............................................................................. 6
1.6.3 Apache Flink ................................................................................ 7
1.6.4 Apache Samza.............................................................................. 7
1.6.5 From Lambda to Kappa Architecture............................................ 8
1.6.5.1 Lambda Architecture ................................................................ 8
1.6.5.2 Kappa Architecture ................................................................... 9
Chapter 3 System Architecture ..............................................................11
1.7 Kafka ............................................................................................... 11
1.7.1 Kafka connect ............................................................................ 11
1.7.1.1 Apache Avro ............................................................................ 12
1.7.1.2 Schema Registry ..................................................................... 12
1.8 Flink ................................................................................................ 13
1.9 Elasticsearch .................................................................................... 13
1.10 Kibana ............................................................................................. 14
1.11 Data processing architecture ............................................................ 15
Chapter 4 Implementation .....................................................................16
1.12 Environment Setup .......................................................................... 16
1.13 Data Generation ............................................................................... 18
1.13.1 Mall Sensor Data ....................................................................... 18
1.13.2 Purchase Data ........................................................................... 19
1.14 Data Processing................................................................................ 20
1.14.1 Study Case 1: Mall Analytics ...................................................... 21
1.14.1.1 Mall Foot Traffic ................................................................... 21
1.14.1.2 In-Mall Proximity Traffic ...................................................... 25
1.14.1.3 Location Marketing .............................................................. 27
1.14.2 Study Case 2: In-Store Optimization .......................................... 29
1.14.2.1 Define Under-performing Categories .................................... 29
iii
1.14.2.2 Customer Payment Preference ............................................. 31
1.14.2.3 Inventory Checking .............................................................. 32
1.14.2.4 Compare sales potential....................................................... 34
1.15 Data Visualization and Distribution ................................................. 35
Chapter 5 Conclusion ............................................................................41
2 Bibliography ....................................................................................43
3 Annexes ...........................................................................................44
Annex 1 ..................................................................................................... 44
Annex 2 ..................................................................................................... 47
iv
List of figures
Figure 1 : 5Vs of Big Data ............................................................................... 3
Figure 8 : Confluent Schema Registry for storing and retrieving schemas ..... 12
iv
Chapter 1 Introduction
1.1 Motivation
The market of retail changed dramatically in the last few years, in addition to
the physical retailers that struggle every day, we now have the online retailers
that added up, and more new brands trying to earn their place in the market.
Taken together all of this, Retailers are adopting new store format strategies in
order to meet the customer’s expectations and industry competition, thus they
started using new technologies such as BLE beacons, Wifi tracking, and
POS(point of sales) to track shoppers behavior using location, time and activity.
This continuous flow of sensor readings will create a wealth of data that can be
processed by a streaming framework such as Flink in order to get quick insights,
and therefore make it possible to quantify the customer’s journey and purchase
pattern, so that at the end, retailers will be able to detect, identify, and track
where people are (location), how long they stay (time), and what they do
(activities), which will help them to be aware of the store’s sales opportunity,
how to optimize their product positioning and adopt location marketing to send
notification to their customers about the offers and discounts that would match
their interest.
1.2 Goal
The goal of this project is to design and implement a retail analytical solution
that help retailers to get more value from their existing resources, improve their
in-store or mall analytic metrics and therefore increase the business benefit.
We start at first by determining the set of problems we aim to solve, for instance
we want to track the visitors pattern inside the mall to control the traffic, then
we need to think of the type of data that will allow us to measure this metric,
which fields do we need to define for generating the data and how we must
produce the records. To do this, we have chosen the first component of our
architecture will be Kafka Connect framework, which will use its Datagen
connector to produce data according to the Avro schemas we will define and
store in Confluent Schema Registry. As a result we will have 2 continuous
streams of data, one representing mall sensors and the other reflecting the data
sent by the point-of-sale systems.
Now that we have constructed the data, we will choose which streaming
processing engine will be deployed, many frameworks can be used to build the
queries for this type of scenario, and our choice was Apache Flink because of
its event timestamps that help set time windows for computations along with
its other features. Flank programs will be developed within two scenarios: first,
we will program a set of metrics for Mall Analytics which are respectively Mall
Foot Traffic, In-Proximity Traffic and Location Marketing. Secondly, we will
program key performance indicators for In-Store-Optimization, which will help
1
define the underperforming products inside the store, detect customer
preferences, and construct sales report and finally optimizing the inventory
management.
Next, we will forward the queries results to Elasticsearch indexes where the data
can be analyzed and explored by Kibana, the visualization tool intended to
create dynamic dashboards and deliver quick insights about the retail
operations, consumer demand and sales. To give an illustration of this, let’s look
at the case where we want to quantify the ability of a store to convert demand
into revenue, which is also known as the sales conversion, to do so we will first
need to gather the data from the POS as well as the store sensors, secondly we
will create a Flink query that will calculate the ration of transactions (buyers)
on visitors, in a window of one day for example, each day the result will be stored
in Elasticsearch and each month the dashboard will be updated in kibana with
the changes and transitions seen in the sales conversion, this way we will be
able to deliver business conclusions and propose optimization for current
strategies.
Chapter 4 represent the core of this thesis, its where we introduce the
different steps followed to setup the work environment and design the retail
application starting from generating data into visualizing the results.
In Chapter 5, we summarize the thesis and point out to future work that can
extend this project.
2
Chapter 2 Background
To provide a better understanding of this project, this chapter highlights the
basic concepts of big data technologies, and more specifically the importance of
streaming analytics in today’s world.
There are several definitions that compete to clarify the term "Big Data", here
we have chosen the most credible one:
« Extremely large data sets that may be analyzed computationally to reveal patterns,
trends, and associations, especially relating to human behavior and interactions ».
Oxford Dictionary
Nevertheless, all the definitions described Big Data based on the 5V principle
that defines the various challenges encountered when designing algorithms or
software systems capable of processing this data.
o Volume: Due to the data explosion caused to digital and social media,
data is rapidly being produced in such large chunks, and the
3
conventional methods of business intelligence and analytic can’t keep up
with how to store and process it.
Big data volume defines the ‘amount’ of data that is generated. The value
of data is also dependent on the size of the data.
o Variety: Data sources in big data may involve external sources as well
as internal business units. Generally, big data is classified as structured,
semi-structured and unstructured data. In fact, almost 80 percent of
data produced globally including photos, videos, mobile data, and social
media content, is unstructured in nature.
o Veracity: Veracity in Big Data is related to the Big Data security and
revolves around two aspects: data consistency (or certainty) what can be
defined by their statistical reliability; and data trustworthiness that is
defined by a number of factors including data origin, collection and
processing methods, including trusted infrastructure and facility. Big
Data veracity ensures that the data used are trusted, authentic and
protected from unauthorized access and modification.
o Value: In the context of big data, value amounts to how worthy the data
is of positively impacting a company’s business. What you do with the
collected data is what matters. With the help of advanced data analytics,
useful insights can be derived from the collected data. These insights, in
turn, are what add value to the decision-making process.
4
making the right data available at the right time, this data might be coming as
sensor events, user activity on a website, financial trades, and so on.
The need for stream processing technologies comes from many reasons:
¾ Batch processing is not the optimal solution when it comes to dealing
with a never-ending stream of events, it will require to stop the data
collection each time we want to store it and process it, afterwards, we will
have to do the next batch and then worry about aggregating across
multiple batches.
On the other hand, streaming handles neverending data streams
naturally and easily. It can detect patterns, inspect results, look at
multiple levels of focus, and also look at data from multiple streams
simultaneously.
¾ Stream processing process data stream as they come in, hence it spreads
the processing over time instead of let in it build up and then process it
at once like batch processing.
Apache Spark is a widely used, highly flexible engine for batch-mode and stream
data processing that is well developed for scalable performance at high volumes.
To maximize the performance of Big Data analytics applications, Spark in-
memory data processing engine conducts analytics, ETL, machine learning and
graph processing on data whether in motion or at rest.
The Apache Spark Architecture is founded on Resilient Distributed Datasets
(RDDs). These are distributed immutable tables of data, which are split up and
allocated to workers. The worker executors implement the data. The RDD is
immutable, so the worker nodes cannot make alterations; they process
information and output results.
5
A Spark application takes data from a collection of sources (HDFS, NoSQL and
relational DBs, etc.), applies a set of transformations on them, and then
executes an action that generates meaningful results.
Introduced by Twitter, Apache Storm is the most popular and widely adopted
open-source distributed real-time framework. It is extremely fast, with the
ability to process over a million records per second per node on a cluster of
modest size with very low latency, Storm is suitable for near real time processing
workloads. It processes large quantities of data and provides results with lower
latency than most other available solutions.
The Apache Storm Architecture is founded on spouts and bolts. Spouts are
origins of information and transfer information to one or more bolts. This
information is linked to other bolts, and the entire topology forms a DAG.
Developers define how the spouts and bolts are connected
6
1.6.3 Apache Flink
7
ensures that no message is lost. It is also scalable as it is partitioned and
distributed at all levels.
The main goals of Apache Samza are having better fault tolerance, processor
isolation, security, and resource management.
With the increasing volume of data and need to analyze and obtain value from
the generated data as soon as possible, there is a need to define new
architectures to cover use cases different from the existing ones.
The most common architectures used by companies are mainly two: Lambda
Architecture and Kappa Architecture. The main difference between both will be
explained in the following sections.
The data stream entering the Lambda system is dual fed into both a batch and
speed layer as shown in figure 6.
8
Figure 6 : Lambda Architecture
9
it arrives, In other words, this architecture can be used in the case where the
only thing that really matters is when you are analyzing this incoming data.
x The starting data is not modified: the data is stored without it being
transformed and the views are derived from it. A specific state can be
recalculated since the source information is not modified.
x There is only one processing flow: since we maintain a single flow, the
code, the maintenance and the system update are considerably reduced.
10
Chapter 3 System Architecture
This chapter describes the main components used to build the work
environment. It discusses briefly the function and features provided by each of
the frameworks in general as well as project specific.
The goal behind this developed architecture is to help whether retailers or Mall
managers take immediate process-based action on the discovered insights so
they can plan out the right changes to stores at the right time.
The set of problems we aim to solve with this architecture can be dived into two
big sections:
First use case would be Mall Analytics, where is it important to keep track of
customers inside the shopping centers to identify Foot-Traffic trends, improve
security and advertising. The second use case is In-Store-Optimization, where
information like customer payment method, products bought during the day
and others that have been returned can help retailers define a customer’s profile,
and therefore come up with more personalized offers. Furthermore, they also
need solutions to optimize their inventory management, conduct deeper
analysis of their sales potential, and thus improve their real-time decisions
1.7 Kafka
Apache Kafka is an Open Source application whose main function is the
centralization of data flows coming from various systems inside a company. This
platform was born as a distributed messaging system in publish-subscribe
mode, capable of easily loading, supporting very high data rates and ensuring
the persistence of the data it receives. The data received by Apache Kafka is kept
within the topic, and each topic corresponds to a category of data. The systems
that publish data in Kafka topics are Producers while the systems that read
data from topics are Consumers.
In this thesis Kafka is used to deliver stream events into Apache Flink without
transforming it, it will receive the messages from the data sources, store them
within topics, and then push them to Flink for processing. Furthermore, it can
also be used in the opposite way, receiving the results from Flink, put them into
topics, so that they can be pulled by the Mall Application.
11
the type of events we’d like to generate, that’s why we defined our own schema
specification using Apache Avro.
Apache Avro is a data serialization system used to define the data schema for
record’s value, this schema will describe the fields allowed in a certain value
along with their data types using JSON format.
In this project, we used Apache Avro to customize the fields and their values to
match the application requirement in having sensor data and purchase data.
1.7.1.2 Schema Registry
Confluent Schema Registry provides a serving layer for the metadata. It provides
a RESTful interface for storing and retrieving Avro schemas, it will keep a
version of our schemas, compatibility settings and serializers.
The Schema Registry lives outside of and separately from the Kafka brokers.
Producers and consumers still talk to Kafka to publish and read data (messages)
to topics. Concurrently, they can also talk to Schema Registry to send and
retrieve schemas that describe the data models for the messages. Likewise, in
our architecture Flink will synchronize with Confluent Schema Registry to
retrieve the Sensor and POS data schemas as shown below:
12
1.8 Flink
Apache Flink is a distributed stream processor with intuitive and expressive
APIs to implement stateful stream processing applications. It efficiently runs
such applications at large scale in a fault-tolerant manner. Flink on its own is
the runtime core engine which does stream processing. On top of the engine,
Flink has it abstraction APIs: data set API to consume and process batch data
sources and data stream API to consume and process real time streaming data.
These two APIs are programming abstractions and are the foundation for user
programs.
Flink stand out thanks to many features, for instance: its event-time and
processing-time semantics, exactly-once state consistency guarantees, layered
APIs with varying tradeoffs for expressiveness and ease of use, millisecond
latencies while processing millions of events per second in addition to many
other features.
In this thesis, Flink is deployed as the stream processing framework of choice
for our streaming retail application, it will continuously consume the streams
of events provided by Kafka, apply transformations on them and then update
the result into the search indexes.
Finally, event timestamps can be used to build time windows for
computations.
1.9 Elasticsearch
13
Figure 11 : Elasticsearch for fast streaming
Apache Flink executes stream analysis jobs on our Sensor data and POS data,
apply transformations to analyze, transform, and model the data in motion, and
finally write their results to an Elasticsearch index. Kibana connects to the
index and queries it for data to visualize.
1.10 Kibana
Kibana is an open-source data visualization and exploration tool used for log
and time-series analytics, application monitoring, and operational intelligence
use cases. It offers powerful and easy-to-use features such as histograms, line
graphs, pie charts, heat maps, and built-in geospatial support.
Furthermore, we can set a time filter that displays only the data within a
specified time range for time based events.
Here, Kibana will be integrated with Elasticsearch to create data visualization
dashboard from the data stored in the indexes. This analytical platform will help
generate insights about a company’s business operations.
14
1.11 Data processing architecture
In this project we will opt for a very common approach for ingesting and
analyzing huge volumes of streaming data, which is known as the Kappa
architecture. It is based on a streaming architecture in which an incoming series
of data is first stored in a messaging engine like Apache Kafka. From there, a
stream processing engine will read the data and transform it into an analyzable
format, and then store somewhere for end users to query.
It has four main principles: data is immutable, everything is a steam, a single
stream engine is used, data can be replayed and it’s implemented as follow:
15
Chapter 4 Implementation
Despite the fact that retailers have been using data analytics to develop
business intelligence for years, the extreme complexity of today’s data
requires new approaches and tools. This is because the retail industry has
entered the big data era, and therefore having access to a huge amount of
information that can be used to optimize the shopping experiences and
forge tighter relationships between customers, brands, and retailers.
This chapter is devoted to describe the steps followed in order to build the
working environment, generate the test data, create the Flink queries and
provide business insights to support retail decision making.
In this Project, our pipeline is divided into to three main blocks: Data generation,
Data processing and Data visualization, which means we need to take into
consideration the resource allocation that we will need for each service, the best
choice here is to opt for a container based technology such as Docker that
support multiple workloads while using the same OS. To setup our work
environment, we will use docker-compose to define and run the multi-container
services, therefore, our environment will be defined as follow:
16
Inside this docker yaml file, we need to define the properties needed for each
service such as: Port Number, links with other containers, and the
environmental variables. The rest of the file is placed in (Annexe 1).
In this cluster, we will work only with one broker and one zookeeper, given the
limited resources we have, same thing goes for task managers for Flink.
Despite that, we will try to implement some Flink fault tolerance mechanisms
such as state and checkpoints that enable programs to recover in case of
failures.
As for developing Flink queries we will use the IntelliJ IDEA along with Maven
build tools to build and manage the project, Also, we need to populate the
POM file with the required dependencies for Flink, Kafka and Elasticsearch.
The following file shows an overview of the POM file:
After setting up the cluster environment on Docker, we need to make sure that
the containers are defined inside the same docker network, in order to
guarantee the communication among containers.
17
1.13 Data Generation
The first step in designing any end-to-end pipeline is to make sure we dispose
of a plentiful amount of data to test the different possible scenarios of the
desired application. There are many data generation tools available on the
Internet that allows us to create datasets at a few clicks, and then export them
to a bunch of formats, including CSV, JSON, XML and even SQL. The problem
with this type of approaches is not only the fact that they do not work well for
producing records with complex data types ( e.g. records with multiple fields or
randomizing the data while maintaining order ) but also that this technique is
not very realistic. With this in mind we need an approach that allows us to
generate data in real-time, the reason why we choose to use Kafka Connect
Datagen connector, for producing realistic customized data using our own
schema specifications with our own fields.
To define the data schema for records values, we will work with Apache Avro
format to describe the fields, the interval at which data is produced and the
number of iterations. In view of this, we will define two Avro schemas, the first
will be responsible for generating data for multiple Mall Sensors (BLE, WIFI
location tracking …) and the second schema will produce purchase Data that
normally will be sent via the POS (Point of Sales) of the different shops.
First, we need to define the schema that will be responsible for generating
Sensor data for multiple zones inside the mall: Food Court, Fashion, Home
Decoration and Electronics.
The fields are defined as follow:
Sensor UID: UID of the sensor sending the customer’s data.
MAC address: of the customer’s phone detected by the sensor.
IP address: of the customer’s, for instance if he used the WIFI of the Mall of any
shop.
Longitude, Latitude: to locate the customer inside the Mall, the coordinates
values corresponds to El Corte Ingles in Plaza de Callao, Madrid.
Timestamp: of the sensor.
The entire schema file is placed in (Annexe 2 ) and need to be moved inside the
kafka-connect container in order to be accessed by the Datagen connector.
Secondly, we need to specify the connector parameters as shown in figure 16.
18
Figure 16 : Configuration of Datagen Connector
The same schema will be used to produce four streams of Data: Zone1, Zone2,
Zone3, and Zone4, which will result in four Datagen connectors.
After starting the four connectors, we can verify the incoming messages on
Kafka:
For generating purchase data the schema will be defined by eight fields as follow
(Annexe 2):
Seller Name: Name of the worker who accomplished the purchase.
Product ID: reference of the sold product inside the store.
19
Product Category: categories of the store: Accessories, Shoes & Bags, Women,
Kids, Men and Home Care.
Items Available in stock: number of items left for a specific product in the stock.
Payment method: typical method payment used are: Cash, Mobile Payment and
Credit Card.
Amount: price paid during the purchase.
Loyalty Card: specify if the customer belongs to a loyalty program.
Timestamp: of the purchase provided by the POS.
For this schema, we will settle with only two streams coming from POS1 and
POS2, and the same connector configuration applied for sensor data will be
used for this two connectors with modifying the filepath, kafka topic and the
number of iterations to only 100.
In Kafka, we can see the incoming messages:
Now that we have our real-time data a hand, we can move forward in the project
and start developing our Flink queries.
20
increase sales opportunity and measure performance of our mall stores,
therefor, in this project we selected the following core metrics for evaluating
the mall and stores performances:
Case study 1: Mall Analytics
x Mall Foot Traffic.
x In-Mall Proximity Traffic.
x Location Marketing.
Case study 2: In-Store Optimization
x Define Under-performing Categories.
x Customer Payment Preference.
x Inventory Checking.
x Compare sales potential.
Before jumping to Flink code, it is important to note that the structure for
each query will be as follow:
– Read the streams of events from a Kafka topic.
– Perform transformations on the incoming messages.
– Write the results of operations to Elasticsearch.
1.14.1 Study Case 1: Mall Analytics
1.14.1.1 Mall Foot Traffic
Mall Foot Traffic is a key indicator for both shopping centers owners and stores,
it gives you an idea on the sales opportunity of all visitors, and it’s calculated as
the sum of all incoming shoppers, per a period of time. To translate this as a
Flink program, we have done as follow:
– We start the program by reading the required Kafka parameters props
that Flink Consumer will use to pull messages from the topic “1” in
Kafka, however, we also need to pass the deserialization schema as well
as the path to the schema-registry “http://schema-registry:8081” to
the FlinkKafkaConsumer<> since we have generated events using our
own customized Avro schema that resides in the Cofluent Schema
Registry.
– In the same way, we will create four Flink Datastreams out of four
Kafka topics respectively for Zone1, Zone2, Zone3 and Zone4.
– Likewise, we will define a map function that will take the sensor data as
input and return a Tuple3<String,Long, Integer>
(value.getSensorUID(), value.getTimestamp(), 1), where 1 at the
end will act as counter. The map function will then be continuously
updated with new tuples as soon as new records will be read.
– Next, we need to perform a count aggregation for Mall Traffic and the
most efficient way to do it is to use a ReduceFunction that will sum the
ones provided by the mapped streams.
– Now that we have the count of visitors for each zone we will move
forward and join the resulting streams in order to produce a
21
joinedstream that returns the Mall Traffic for the whole shopping
center.
– Finally, we will create an Elasticsearch sink that will be responsible for
sending the results back to an Elasticsearch index flink-streams,
which in turn will be used by Kibana to deliver quick insights.
The corresponding Flink code is shown below:
env.getConfig().setGlobalJobParameters(parameterTool);
// we set the time characteristic to include a processing time window
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
//Zone2~Food Court
22
DataStream<Zone2_Sensors> zone2 = env.addSource(new FlinkKafkaConsumer<>
("2", ConfluentRegistryAvroDeserializationSchema.forSpecific
(Zone2_Sensors.class, "http://schema-registry:8081") , props));
@Override
public Tuple3<String,Long, Integer> map(Zone2_Sensors value) throws
Exception {
@Override
public Tuple3<String, Long, Integer> reduce(Tuple3<String, Long, Integer>
value1, Tuple3<String, Long, Integer> value2) throws Exception {
return new Tuple3<String, Long,
Integer>(value1.f0,value1.f1,value1.f2+value2.f2);
}});
//Zone3~Home Decoration
DataStream<Zone3_Sensors> zone3 = env.addSource(new FlinkKafkaConsumer<>("3",
ConfluentRegistryAvroDeserializationSchema.forSpecific(Zone3_Sensors.class,
"http://schema-registry:8081") , props));
@Override
public Tuple3<String,Long, Integer> map(Zone3_Sensors value) throws Exception
{
return new Tuple3<String,Long, Integer>(value.getSensorUID(),
value.getTimestamp(), 1);
}});
@Override
public Tuple3<String, Long, Integer> reduce(Tuple3<String, Long, Integer>
value1, Tuple3<String, Long, Integer> value2) throws Exception {
return new Tuple3<String, Long, Integer>
(value1.f0,value1.f1,value1.f2+value2.f2);
}});
DataStream<Tuple3<Long,Integer,Integer>> firstStream =
23
resultzone1.join(resultzone2).where(
new KeySelector<Tuple3<String,Long,Integer>, Long>() {
@Override
public Long getKey(Tuple3<String,Long,Integer> value) throws Exception {
return value.f1;
}}
).equalTo(
new KeySelector<Tuple3<String,Long,Integer>, Long>() {
@Override
public Long getKey(Tuple3<String,Long,Integer> value) throws Exception {
return value.f1;
}}
).window(TumblingProcessingTimeWindows.of(Time.minutes(20)))
.apply(new JoinFunction<Tuple3<String,Long,Integer>,
Tuple3<String,Long,Integer>, Tuple3<Long,Integer,Integer>>()
{
@Override
public Tuple3<Long, Integer, Integer> join(Tuple3<String, Long, Integer>
zone1, Tuple3<String, Long, Integer> zone2) throws Exception {
return new Tuple3<Long, Integer, Integer>(zone1.f1,zone1.f2,zone2.f2);
}});
DataStream<Tuple4<Long,Integer,Integer,Integer>> joinedStream =
firstStream.join(resultzone3).where(
new KeySelector<Tuple3<Long,Integer,Integer>, Long>() {
@Override
public Long getKey(Tuple3<Long,Integer,Integer> value) throws Exception {
return value.f0;
}}
).equalTo(
new KeySelector<Tuple3<String,Long,Integer>, Long>() {
@Override
public Long getKey(Tuple3<String,Long,Integer> value) throws Exception {
return value.f1;
}}
).window(TumblingProcessingTimeWindows.of(Time.minutes(20)))
.apply(new JoinFunction<Tuple3<Long,Integer,Integer>,
Tuple3<String,Long,Integer>, Tuple4<Long,Integer,Integer,Integer>>()
{
@Override
public Tuple4<Long,Integer,Integer,Integer> join(Tuple3<Long,Integer,Integer>
zone1_2, Tuple3<String, Long, Integer> zone3) throws Exception {
return new
Tuple4<Long,Integer,Integer,Integer>(zone1_2.f0,zone1_2.f1,zone1_2.f2,zone3.f2
);
}});
ElasticsearchSink.Builder<Tuple4<Long,Integer,Integer,Integer>> esSinkBuilder
= new ElasticsearchSink.Builder<>(httpHosts,new ElasticsearchSinkFunction
24
<Tuple4<Long,Integer,Integer,Integer>>() {
return Requests.indexRequest()
.index("flink-streams")
.type("flink-stream")
.source(json);
}
@Override
public void process(Tuple4<Long,Integer,Integer,Integer> element,
RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder.setBulkFlushMaxActions(1);
joinedStream.addSink(esSinkBuilder.build());
env.execute();
}
}
While mall foot traffic quantifies the number of visitors during a regular period
of time, in mall proximity will sum the number of visitor per zone, that is to say
it will give the owners an idea about the traffic flow around the mall. The steps
to develop the corresponding Flink code is the following:
– For this query instead of using a ReduceFunction, we will opt for a
ProcessWindowFunction that will iterate over the sensor stream and
maintain the count of elements in a
TumblingProcessingTimeWindows.of(Time.minutes(30))).
– Inside this process function, we will filter the records as they enter
using the longitude and latitude coordinates in order to get the counts
per area inside a specific zone, here I took an example of three areas
that may exist inside the fashion zone1.
– As a result, we will have a Tuple2<String,Integer>(area, count)
containing the name of each area along with the number of visitors in
the same time.
Flink code is described in the next page.
25
public class FlinkQuery2 {
public static void main(String[] args) throws Exception{
env.getConfig().setGlobalJobParameters(parameterTool);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
@Override
public long extractAscendingTimestamp(Zone1_Sensors element) {
return element.getTimestamp();
}
}));
@Override
public void process(Tuple tuple, Context context, Iterable<Zone1_Sensors>
input, Collector<Tuple2<String,Integer>> out) throws Exception {
int count = 0;
String area = new String();
26
if (in.getLongitude() <= -3.705381 && in.getLongitude()> -3.705689) {
count++;
area = "Zone1-Area2-H&M";
}}
Now that we dispose of the number of visitor per zones, instead of just using it
to identify traffic patterns inside the mall, we can also use it to prevent traffic
jam inside the shopping center, to put it in another way, we will control visitors
traffic by sending personalized promotions to individuals based on their
location.
To do so, we will use the previous stream proximitytraffic as follow:
– The goal here, is to control visitor’s traffic, and to do so we will start by
iterating over the previous proximitytraffic stream, and whenever we
find a sum of visitors that exceeds 30 people in some area, we will send
a personalized promotions as a notification in the customer’s app.
– In the normal case, the results of trafficjam stream must be pushed to
a Kafka topic so that they can be pulled by the user’s app to generate
recommendations and advertisements, but since we don’t dispose of a
Mall application, we instead used the textlocal’s messenger API to send
SMS with the Ad "Up to 70% Off on Kids Wear at H&M".
– At the end, we will also push the result of these previous two queries to
Elasticsearch as proximity-traffic and traffic-jam indexes.
The Flink code is below:
// Send an Advertisement to the Customers with Loyalty Programs or that dispose
of The Mall Application
@Override
public void process(Tuple tuple, Context context, Iterable<Tuple2<String,
Integer>> input, Collector<Tuple3<Integer, String, String>> out)
throws Exception {
27
if ( in.f1 >=25 ) {
traffic.f0 = in.f1;
traffic.f1 = in.f0;
// Construct data
String apiKey = "apikey=" + "****************";
String message = "&message=" + "Up to 70% Off on Kids Wear at
H&M";
String sender = "&sender=" + "Mall Discounts";
String numbers = "&numbers=" + "********";
// Send data
HttpURLConnection conn = (HttpURLConnection) new
URL("https://api.txtlocal.com/send/?").openConnection();
conn.getOutputStream().write(data.getBytes("UTF-8"));
final BufferedReader rd = new BufferedReader(new
InputStreamReader(conn.getInputStream()));
traffic.f2 = stringBuffer.toString();
}
else {
traffic.f0 = in.f1;
traffic.f1 = in.f0;
traffic.f2 = "No sign of Traffic Jam at this moment";
}}
out.collect(newTuple3<Integer,String,String>(traffic.f0,traffic.f1,traffic.f2));
}
});
//ElasticSearch Sink Config
28
return Requests.indexRequest()
.index("proximity-traffic")
.type("proximity-traffic")
.source(json);
}
@Override
public void process(Tuple2<String,Integer> element, RuntimeContext ctx,
RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder.setBulkFlushMaxActions(1);
proximityTraffic.addSink(esSinkBuilder.build());
return Requests.indexRequest()
.index("traffic-jam")
.type("traffic-jam")
.source(json);
}
@Override
public void process(Tuple3<Integer,String,String> element, RuntimeContext ctx,
RequestIndexer indexer) { indexer.add(createIndexRequest(element));
}} );
In this section as for the next queries we will rather focus on optimizing the in-
store activities and as a first step, we will develop a query that will help us to
detect the under-performing categories (Accessories, Shoes & Bags, Women,
Kids, Men and Home Care) in a store, and therefore, find a way to boost the
sales without letting it impact the profits.
To do so, we will make use of the POS streams:
– For this use case, we are reading purchase records from POS1,POS2
topics, after that we will keyBy("Product_Category"), so that we can
maintain the count of items per a key.
– Then, we will count the number of products sold, inside a window of 15
min, and finally as a result, the WindowProcessFunction will return a
Tuple containing the category of the product, date of purchase and an
29
updated counter for how many items from that category were sold until
now.
– These results will be then sent to -product-categories- Elasticsearch
index.
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData> input,
Collector<Tuple3<String, String, Integer>> out) throws Exception {
int count = 0;
String category = new String() ;
String date = new String() ;
30
ElasticsearchSink.Builder<Tuple3<String,String,Integer>> esSinkBuilder1 = new
ElasticsearchSink.Builder<>(httpHosts,new ElasticsearchSinkFunction
<Tuple3<String,String,Integer>>() {
return Requests.indexRequest()
.index("product-categories")
.type("product-categories")
.source(json);
}
@Override
public void process(Tuple3<String,String,Integer> element,
RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder1.setBulkFlushMaxActions(1);
categoryCount.addSink(esSinkBuilder1.build());
The goal from this query is the same as every In-store optimization method, to
improve the customer experience, by detecting the preferences of your visitors
and therefore review the flexibility of your POS system, with the aim of
creating new opportunities to deepen the relationship with your customers.
Here we have the same flow of functions used in the previous query only with
a different output: Tuple2<String,Integer>(payment, count) that will be pushed
to a different index store-payment.
The implementation is as follow:
@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData> input,
Collector<Tuple2<String, Integer>> out) throws Exception {
int count = 0;
String payment = new String() ;
for (PurchaseData in: input) {
count++;
payment= in.getPaymentMethod();
}
31
out.collect(new Tuple2<String,Integer>(payment, count));
}
});
return Requests.indexRequest()
.index("store-payment")
.type("store-payment")
.source(json);
}
@Override
public void process(Tuple2<String,Integer> element, RuntimeContext ctx,
RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}}
);
esSinkBuilder2.setBulkFlushMaxActions(1);
paymentMethod.addSink(esSinkBuilder2.build());
32
The program is defined below :
// Inventory Checking
DataStream<Tuple3<Long,Integer,String>> stockAvailability = purchaseStream
.keyBy("productid")
.timeWindow(Time.minutes(15))
.process(new org.apache.flink.streaming.api.functions.windowing
.ProcessWindowFunction<PurchaseData, Tuple3<Long, Integer, String>,
Tuple, TimeWindow>() {
@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData>
elements, Collector<Tuple3<Long, Integer, String>> out) throws Exception {
return Requests.indexRequest()
.index("stock-availability")
.type("stock-availab")
.source(json);
}
@Override
public void process(Tuple3<Long,Integer,String> element,
RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}});
33
esSinkBuilder3.setBulkFlushMaxActions(1);
stockAvailability.addSink(esSinkBuilder3.build());
Finally, we will construct a metric that will measure the store daily sales in the
aim of creating sales report for each month. In this query, the transformation
will be simpler:
it consist of getting from the POS stream the amount paid by shoppers in each
purchase along with the date, then storing the added results to Elasticsearh
store-sale index.
Flink code is the following :
@Override
public void process(Tuple tuple, Context context, Iterable<PurchaseData>
elements, Collector<Tuple2<String, Integer>> out) throws Exception {
int totalSales = 0;
String date = new String() ;
return Requests.indexRequest()
.index("store-sales")
.type("store-sale")
.source(json);
}
@Override
public void process(Tuple2<String,Integer> element, RuntimeContext ctx,
RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}} );
34
esSinkBuilder4.setBulkFlushMaxActions(1);
totalSales.addSink(esSinkBuilder4.build());
env.execute();
This previous developed metrics will help the owners quantify the customer’s
journey inside the mall, determine a particular customer's purchase pattern
and therefore increase the sales…etc.
Nevertheless, there is an infinite number of metrics that can be deployed and
others that can be improved, and to specify this, you need to clearly define the
problem at hand from the beginning in order to determine the adequate metrics
that can solve it.
Now that we have the results of the queries stored in Elasticsearch indexes, the
analytics team will use these information to monitor the retail metrics. In this
project, we used kibana visualization features to demonstrate the data
exploration.
This section will then illustrate the visualizations created for the dashboard
from the Elasticsearch indices. The first visualization created was a stacked
line chart for monitoring Mall Foot Traffic inside the shopping center:
35
This chart will give the owners an idea about how the flow of visitors changes
in different zones of the Mall per media, event and timing, it will help them
detect in which period of time visitors tend to shop. For examples in
timestamp 1583026800 which corresponds to 1/3/2020 à 1:40:00, we can
clearly see there’s a spike that may be explained by a period of discount or
some kind of event inside the mall.
Another chart for in-mall analytics would be a heat map that will be updated
with the number of visitors present in each area:
36
Figure 21 : Stacked Pie chart for Traffic Control
This visualization gives a clear view on the number of visitors present in each
area along with the possibility of having traffic jam in front of a store.
If the number of people exceeds 30, then the Mall App will trigger a push
notification to the visitors in that area ( the ones having the app installed in
their phones), with recommendations and discounts available in other stores,
this way we can control traffic flow and prevent theft attempts.
Now, for in-store metrics, we will start by creating a Horizontal bar chart, to
represent the number of items sold by each store category as follow:
37
For this period of time, we clearly see that there’s a small difference in the
number of items sold for each category, this would help the retailers to
think of new strategies to boost the sales in the underperforming categories,
such as: buy one item from Category X, you get -50% on item in Category Y.
38
Figure 24 : Pie chart for stock control
The main goal behind this metric is to prevent lost sales, by monitoring the
stock and having up-to-date information about the store health. We can use
this metric not only to help planning new inventory purchases, but also to
estimate product-demand and detect the best-selling products. In our case,
it will trigger a notification to inform the analytical team whenever inventory
levels are out of sync with demand, therefore we will provide retailers with
enough time to increase their orders before a product can go viral (in the
query we put a threshold of 30 items, per product).
And last but not least, we will create an area chart to quantify sales growth:
This metric will be used to build report sales, it will include the total sales
revenue, profits and analysis for sales growth. This will determine how
efficiently the company is making money, how the business profitability is
39
changing and if the current business strategy is effective or does it need
restructuring.
We can see in the chart, that in the same day the total sales may increase or
decrease, that’s because we took into consideration the items that
customers might took, and then return for some reason, which will result in
sales decrease.
To summarize this part of the thesis, we must say that whenever a company
wants to perform any kind of visualization, it must first decide the goal or
the value behind it, only then the generated insights can help create a more
effective retail optimization solutions.
40
Chapter 5 Conclusion
The goal of the this project was to develop a stream processing pipeline able to
examine individual shopping behavior with the aim to help Mall managers and
Store owners to quickly react on discovered insights while it still counts. Our
architecture brings together the stream processing capabilities, real-time
analytics and visual dashboards into one solution.
Chapter 3 focused on exploring the main components used to build our system
architecture and how they will interact with each other. For generating the
sensor and point-of-sale data, Kafka Connect Datagen was used along with the
Confluent Schema registry to produce Avro messages, those records will be then
sent to Kafka topics so they can be pulled by the streaming processing
framework Flink to be combined and analyzed afterwards. Then, the result of
Flink transformations will be sent to Elasticsearch, a search engine that will
store the results as indexes so that Kibana, the visualization and exploration
tool, can easily query and analyze them to deliver quick insights.
41
Traffic helps the managers track the visitors patterns inside the mall and
therefore have an optimized visibility about the traffic jam and the most used
routes inside. Secondly, In-Store Optimization, in this part we developed Flink
programs that can help the store owners investigate the shopper experience
with the aim to address their needs and expectations. In detail, the first program
will help identify the underperforming product categories inside the store
(accessories, kids, woman…), this way they will be aware of what area needs to
be optimized and products are not selling. The second one will help with
monitoring the inventory levels, to make sure products are never out-of-stock,
especially in today’s market where customers have many choices in front of
them, and mostly just a click away. The third program will try to detect the
customer payment preferences so that they can adapt offers to them and opt for
an up-to-date point-of-sales systems, this is a part of defining a customer’s
profile to provide more personalized experience. The last metric’s goal, is to have
a history of store’s daily sales in order to construct sales reports and have
visibility on the effectiveness of the current business strategies. All these
previous results will be stored into Elasticsearch indexes, and updated after
each computation, then Kibana will connect to those indexes and use the stored
data to generate visualization reports, dynamic dashboards and alerts.
All things considered, this project can be expanded to by adding another storage
layer to our system architecture, such as Hbase or Cassandra, where we can
store results for more deep analysis, and also by implementing other Machine
learning programs that can predict trends, customer behavior and inventory
needs (determining when it is time to reorder, and knowing when a shortage
might occur) and therefore make better decisions.
42
2 Bibliography
[1] https://behavioranalyticsretail.com/mall-analytics/
[2] https://books.google.es/books?id=hQ4_AQAAQBAJ&printsec=frontcover
[3] https://www.straitstimes.com/lifestyle/fashion/zaras-secret-to-success-
lies-in-big-data-and-an-agile-supply-chain
[4] https://www.confluent.io/blog/easy-ways-generate-test-data-
kafka/?utm_source=github&utm_medium=demo&utm_campaign=ch.examples
_type.community_content.top
[5] https://medium.com/@raymasson/kafka-elasticsearch-connector-
fa92a8e3b0bc
[6] https://www.ververica.com/blog/kafka-flink-a-practical-how-to
[7] https://www.ververica.com/blog/using-apache-flink-for-smart-cities-the-
case-of-warsaw
[8] https://docs.confluent.io/current/schema-registry/index.html
[9] https://ci.apache.org/projects/flink/flink-docs-release-1.11/try-
flink/flink-operations-playground.html
[10] https://www.vitria.com/pdf/SolutionOverview-Retail.pdf
[11] https://www.ververica.com/flink-forward-berlin-
2018?__hstc=37051071.532e807f835377e1b0c3e432ada7ea45.15952633311
45.1595263331145.1595263331145.1&__hssc=37051071.1.1595263331146
&__hsfp=528206255
[12] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
aggregations-bucket-terms-aggregation.html
[13] https://github.com/confluentinc/avro-random-generator
43
3 Annexes
Annex 1
Docker-compose.yaml
---
version: '2'
services:
zookeeper:
image: wurstmeister/zookeeper:latest
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_TICK_TIME: 2000
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
kafka:
image: wurstmeister/kafka:latest
container_name: kafka
ports:
- "9092:9092"
- "9093:9093"
depends_on:
- zookeeper
expose:
- "9093"
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENERS: INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9092,OUTSIDE://localhost:9093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
schema-registry:
image: confluentinc/cp-schema-registry:5.0.0
container_name: schema-registry
depends_on:
- zookeeper
- kafka
ports:
- '8081:8081'
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092
connect-1:
image: confluentinc/kafka-connect-datagen:0.3.0
build:
context: .
dockerfile: Dockerfile-confluenthub
container_name: connect-1
restart: always
ports:
- "8083:8083"
depends_on:
- zookeeper
- kafka
- schema-registry
environment:
CONNECT_BOOTSTRAP_SERVERS: kafka:9092
44
CONNECT_REST_PORT: 8083
CONNECT_HOST_NAME: connect-1
CONNECT_GROUP_ID: kafka-connect-sensordata
CONNECT_REST_ADVERTISED_HOST_NAME: connect-1
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
CONNECT_KEY_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
CONNECT_INTERNAL_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_PRODUCER_INTERCEPTOR_CLASSES:
"io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor"
CONNECT_CONSUMER_INTERCEPTOR_CLASSES:
"io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"
CONNECT_ZOOKEEPER_CONNECT: 'zookeeper:2181'
CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components"
CONNECT_LOG4J_LOGGERS:
"org.apache.kafka.connect.runtime.rest=WARN,org.reflections=ERROR"
kafka-topics-ui:
image: landoop/kafka-topics-ui:latest
depends_on:
- zookeeper
- kafka
- rest-proxy
ports:
- "8000:8000"
environment:
KAFKA_REST_PROXY_URL: 'rest-proxy:8082'
PROXY: "true"
rest-proxy:
image: confluentinc/cp-kafka-rest:5.3.1
container_name: rest-proxy
depends_on:
- zookeeper
- kafka
- schema-registry
ports:
- "8082:8082"
environment:
KAFKA_REST_HOST_NAME: rest-proxy
KAFKA_REST_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092
KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
kafka-connect-ui:
image: landoop/kafka-connect-ui:latest
depends_on:
- zookeeper
- kafka
- connect-1
ports:
- "8001:8000"
environment:
CONNECT_URL: 'connect-1:8083'
jobmanager:
image: flink:latest
expose:
45
- "6123"
ports:
- "8089:8089"
command: jobmanager
links:
- zookeeper
- kafka
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
taskmanager:
image: flink:latest
expose:
- "6121"
- "6122"
depends_on:
- jobmanager
command: taskmanager
links:
- "jobmanager:jobmanager"
- kafka
- zookeeper
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
kibana:
image: docker.elastic.co/kibana/kibana:7.7.1
restart: always
ports:
- "5601:5601"
environment:
ELASTICSEARCH_URL: http://172.18.0.9:9200
ELASTICSEARCH_HOSTS: http://172.18.0.9:9200
depends_on:
- elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.7.1
environment:
- xpack.security.enabled=false
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- cluster.routing.allocation.disk.threshold_enabled=false
ulimits:
memlock:
soft: -1
hard: -1
ports:
- "9200:9200"
expose:
- "9200"
46
Annex 2
{ "namespace": "com.SensorData",
"name": "Zone1_Sensors",
"type": "record",
"fields": [
{"name":"sensor_UID", "type": {
"type":"string",
"arg.properties":{
"options": [ "e74e8480-42da-46c1-b197-584e6030120c",
"e4b9e4b4-4713-419d-80cf-17a21c48d5da",
"90330f07-b71b-4e60-9b16-cd11479f78f7",
"d0b5a283-9bb9-46f6-9a55-7f13fd280481",
"88b63897-b90e-44cb-a455-7b67ff051dc0",
"7689b71e-9899-4e70-a1c7-a019aef39f9e",
"dbea6409-289e-4bba-8866-78a967980350",
"84e05ffe-4f5f-4e72-a5ed-857bb7a90002",
"542a5ea7-6312-4142-9682-b128c8ef444d",
"140faa0c-4be8-4a96-8e51-aca4fe0541f4"
]
}
}},
{"name": "MAC", "type": {
"type": "string",
"arg.properties": {
"options": [ "CE-F2-94-BD-00-B7", "97-AF-38-A3-E9-C7","7E-4C-
CF-8C-76-B6", "37-1C-0D-74-5F-59","AE-1B-41-96-E4-E6","8A-0D-32-D1-69-4E","85-36-04-
91-79-90","15-08-B3-41-77-30","99-E9-FD-05-84-DB","C3-F4-4C-30-15-D8","46-22-18-F1-
0F-FC","8E-F7-57-A5-83-BE","D7-D0-57-41-A8-D3", "97-28-40-58-BB-24","AF-17-37-25-07-
66", "0E-A7-86-16-8A-4C", "E0-8A-AA-4C-8F-DD", "47-D9-45-1B-91-BD", "10-53-2F-17-95-
67", "C2-2E-40-46-E1-4D" ]
}
}},
{"name": "ip", "type": {
"session": "true",
"type":"string",
"arg.properties": {
"options":["111.152.45.45",
"111.203.236.146",
"111.168.57.122",
"111.249.79.93",
"111.168.57.122",
"111.90.225.227",
"111.173.165.103",
"111.145.8.144",
"111.245.174.248",
"111.245.174.111",
"222.152.45.45",
"222.203.236.146",
"222.168.57.122",
"222.249.79.93",
"222.168.57.122",
"222.90.225.227",
"222.173.165.103",
"222.145.8.144",
"222.245.174.248",
"222.245.174.222",
"122.152.45.245",
"122.203.236.246",
"122.168.57.222",
"122.249.79.233",
"122.168.57.222",
"122.90.225.227",
47
"122.173.165.203",
"122.145.8.244",
"122.245.174.248",
"122.245.174.122",
"233.152.245.45",
"233.203.236.146",
"233.168.257.122",
"233.249.279.93",
"233.168.257.122",
"233.90.225.227",
"233.173.215.103",
"233.145.28.144",
"233.245.174.248",
"233.245.174.233"
]
}
}
},
{"name": "latitude", "type": {
"type": "double",
"arg.properties": {
"range": {
"min": 40.419373,
"max": 40.419999
}
}
}},
{"name": "longitude", "type": {
"type":"double",
"arg.properties":{
"range":{
"min": -3.705999,
"max": -3.705075
}
}
}},
{
"name": "Timestamp","type": {
"type": "long",
"arg.properties":{
"iteration":{
"start" : 1583024400,
"restart": 1590973200,
"step": 300
}}
}
}
]
}
PurchaseData.avsc:
{ "namespace": "com.SensorData",
"name": "PurchaseData",
"type": "record",
"fields": [
{
"name": "Seller_Name","type": {
"type": "string",
"arg.properties":{
"options": [ "Seller_1", "Seller_2", "Seller_3", "Seller_4"
]}
48
}},
{"name": "productid",
"type": {
"type": "long",
"arg.properties": {
"iteration":{
"start" : 1,
"restart": 30
}}
}
},
{
"name": "Product_Category","type": {
"type": "string",
"arg.properties":{
"options": [ "Accessories", "Shoes & Bags", "Women", "Kids",
"Men", "Home Care"
]}
}},
{
"name": "Number_Available_stock",
"type": {
"type": "int",
"arg.properties": {
"range":{
"min": 0,
"max": 100
}
}
}},
{
"name": "Payment_Method","type": {
"type": "string",
"arg.properties":{
"options": [ "Cash", "Mobile Payment", "Credit Card"
]}
}},
{"name": "Amount",
"type": {
"type":"int",
"arg.properties":{
"range":{
"min": 3,
"max": 200
}
}
}},
{"name": "Loyalty_Card","type": {
"type": "string",
"arg.properties":{
"options": [ "true" , "false" ]
}}},
{
"name": "Timestamp","type": {
"type": "long",
"arg.properties":{
"iteration":{
"start" : 1583024400,
"restart": 1588294861,
"step": 86400
}}
}
}
]
}
49