Commercial Analytics
of Clickstream Data using
Hadoop
June 2014
Submitted by:
Kartik Gupta
201100048
M.C.A
Thapar University
Submitted to:
School of Mathematics and
Computer Application
Department,
Thapar university,
Patiala.
Outline
Overview
Big Data
Hadoop
Major Steps
Results and Analysis
Conclusion and Future Scope
Overview
This Project gives an analytic report to find the
behavior and location of visitor using Hadoop.
Map Reduce is implemented to refine and sort
the raw data.
Searching is done based on the country, ip
addresses, Postal code, categories wise
Hadoop is a tool which converts the
unstructured, structured and semi-structured
data into pair into a single value which is
represented in binary format.
MapReduce framework is used for parallel
implementation.
Big Data
Big Data is a term used to describe large
collections of data that may be unstructured
grow so large and quickly that it is difficult to
manage with regular database or statistical
tools.
3 vs of Big data
Hadoop
Open source project started by Doug Cutting
A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware
Data storage (HDFS)
Runs on commodity hardware (usually Linux)
Horizontally scalable
Processing (MapReduce)
Parallelized (scalable) processing
Fault Tolerant
CORE PARTS OF HADOOP
Hadoop Distributed File System(HDFS)
Hadoop Distributed File System (HDFS) is a
Java-based file system that provides scalable
and reliable data storage that is designed to
span large clusters of commodity servers.
Some specific features ensure that the
Hadoop clusters are highly functional
RackAwareness
Minimal Data Motion
Utilities
Rollback
Highly Operable
How HDFS works
MapReduce
MapReduce is a programming model and an
associated implementation for processing large
data sets.
MapReduce usually splits the input data-set into
independent chunks which are processed in a
completely parallel manner.
This allows programmers without any experience
with
parallel and distributed systems to easily
utilize the resources of a large distributed system.
The run-time system takes care of scheduling
tasks, monitoring them and re-executes the failed
tasks.
Execution flow in MapReduce
1. Mapreduce program that has been written tells
the job client to run a mapreduce job.
10
Execution flow in MapReduce
2.This sends a message to the Jobtracker which
produces a unique ID for the job.
11
Execution flow in MapReduce
3. JobClient copies job resources , such as jar file.
12
Execution flow in MapReduce
4. Once the resources are in Distributed
Filesystem, the JobClient can tell the JobTracker to
start the job.
13
Execution flow in MapReduce
5. The JobTracker does its own initialization for the
job.. It retrieves these input splits from the
distributed file system.
14
Execution flow in MapReduce
6. Now that the Jobtracker has work for
Tasktrackers, it will return the map task or reduce
task as response to the heart beat.
15
Execution flow in MapReduce
7. The TaskTracker need to obtain the code to
execute, so they get it from the shared file system.
16
Execution flow in MapReduce
8. The TaskTracker now will run the job.
17
OTHER TECHNOLOGICAL
TERMS
Clickstream Data
Clickstream data is an information trail a user leaves
behind while visiting a website. It is typically captured in
semi-structured website log files.
Potential Uses of Clickstream Data
What is the most efficient path for a site visitor to research
a product, and then buy it?
What products do visitors tend to buy together, and what
are they most likely to buy in the future?
Where should I spend resources on fixing or enhancing the
user experience on my website?
Basically we will focus on the path optimization use case.
Specifically: how can we improve our website to reduce
bounce rates and improve conversion?
STEP I
Upload Acme website log dataset contains about 4 million
rows of data, which represents five days of clickstream
data.
STEP II
Represent the dataset in unstructured format i.e
timestamp, registerd user swid, ip address,
geocoded ip address, url
STEP III
Represent the users data from the unstructured
loaddataset
STEP IV
Represent the products categories wise from
the dataset
STEP V
Shows the refine dataset of acme logfiles
STEP VI
Combine all the tables i.e acme log, products, users.
Results and Analysis
Configuration of Hadoop
Results and Analysis
Count the no of VISITORS from any country
Results and Analysis
Retrieving the ip address and displaying the state of visitors
Results and Analysis
Showing the no of ip access this category at a time
Results and Analysis
Initial stage of mapping and reduction
Results and Analysis
Category accessed by total no of ips
Results and Analysis
Showing shoes category acc to state access by total no of ip
Results and Analysis
showing details of ip accessed by visitors but gender wise
Result and Analysis
No of Females accessed this page
Result and Analysis
Total no of ip address accessed particular webpage
Result and Analysis
Calculate the sum of ages of all the visitors
Conclusion
The amount of clickstream data is rapidly
growing and with this demand for accessing
information
over
web
has
increased
significantly.
Therefore analyze the behavior and location
of the visitor.
It is inefficient to process large data using
traditional sequential method
Therefore MapReduce is used for processing
large datasets
Future Scope
Clickstream information play an important
role in a wide variety of applications such as
decision support systems, profile-based
marketing.
Location search is used by various industries
like telecom , e-commerce industry , in event
detection.
Nearest location method can be fused with
any other method to help in better way for
decision making.
Then the tradeoff would be done between
distance and other factor that would be fused
Thank you !!!