0% found this document useful (0 votes)

292 views38 pages

Clickstream Data

This document presents a project analyzing clickstream data from a commercial website using Hadoop. Hadoop is used to refine and sort raw clickstream data captured from the site. MapReduce is implemented for parallel processing. Analysis is performed to determine visitor behavior and location based on country, IP address, postal code and categories. Results show configuration of Hadoop, counts of visitors by country and IP addresses, category access by IPs, and age summaries of visitors. The large amount of clickstream data grows quickly, so Hadoop is useful for scalable processing.

Uploaded by

Kartik Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

292 views38 pages

Clickstream Data

Uploaded by

Kartik Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Commercial Analytics

of Clickstream Data using

Hadoop

June 2014

Submitted by:
Kartik Gupta
201100048
M.C.A
Thapar University

Submitted to:
School of Mathematics and
Computer Application
Department,
Thapar university,
Patiala.

Outline
Overview
Big Data
Hadoop
Major Steps
Results and Analysis
Conclusion and Future Scope

Overview
This Project gives an analytic report to find the
behavior and location of visitor using Hadoop.
Map Reduce is implemented to refine and sort
the raw data.
Searching is done based on the country, ip
addresses, Postal code, categories wise
Hadoop is a tool which converts the
unstructured, structured and semi-structured
data into pair into a single value which is
represented in binary format.
MapReduce framework is used for parallel
implementation.

Big Data
Big Data is a term used to describe large
collections of data that may be unstructured
grow so large and quickly that it is difficult to
manage with regular database or statistical
tools.
3 vs of Big data

Hadoop

Open source project started by Doug Cutting

A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware

Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Horizontally scalable

Processing (MapReduce)

Parallelized (scalable) processing

Fault Tolerant

CORE PARTS OF HADOOP

Hadoop Distributed File System(HDFS)

Hadoop Distributed File System (HDFS) is a
Java-based file system that provides scalable
and reliable data storage that is designed to
span large clusters of commodity servers.
Some specific features ensure that the
Hadoop clusters are highly functional
RackAwareness
Minimal Data Motion
Utilities
Rollback
Highly Operable

How HDFS works

MapReduce
MapReduce is a programming model and an
associated implementation for processing large
data sets.
MapReduce usually splits the input data-set into
independent chunks which are processed in a
completely parallel manner.
This allows programmers without any experience
with
parallel and distributed systems to easily
utilize the resources of a large distributed system.
The run-time system takes care of scheduling
tasks, monitoring them and re-executes the failed
tasks.

Execution flow in MapReduce

1. Mapreduce program that has been written tells

the job client to run a mapreduce job.
10

Execution flow in MapReduce

2.This sends a message to the Jobtracker which

produces a unique ID for the job.
11

Execution flow in MapReduce

3. JobClient copies job resources , such as jar file.

Execution flow in MapReduce

4. Once the resources are in Distributed

Filesystem, the JobClient can tell the JobTracker to
start the job.
13

Execution flow in MapReduce

5. The JobTracker does its own initialization for the

job.. It retrieves these input splits from the
distributed file system.
14

Execution flow in MapReduce

6. Now that the Jobtracker has work for

Tasktrackers, it will return the map task or reduce
task as response to the heart beat.
15

Execution flow in MapReduce

7. The TaskTracker need to obtain the code to

execute, so they get it from the shared file system.
16

Execution flow in MapReduce

8. The TaskTracker now will run the job.

OTHER TECHNOLOGICAL
TERMS
Clickstream Data
Clickstream data is an information trail a user leaves
behind while visiting a website. It is typically captured in
semi-structured website log files.
Potential Uses of Clickstream Data
What is the most efficient path for a site visitor to research
a product, and then buy it?
What products do visitors tend to buy together, and what
are they most likely to buy in the future?
Where should I spend resources on fixing or enhancing the
user experience on my website?
Basically we will focus on the path optimization use case.
Specifically: how can we improve our website to reduce
bounce rates and improve conversion?

STEP I

Upload Acme website log dataset contains about 4 million

rows of data, which represents five days of clickstream
data.

STEP II

Represent the dataset in unstructured format i.e

timestamp, registerd user swid, ip address,
geocoded ip address, url

STEP III

Represent the users data from the unstructured

loaddataset

STEP IV

Represent the products categories wise from

the dataset

STEP V

Shows the refine dataset of acme logfiles

STEP VI

Combine all the tables i.e acme log, products, users.

Results and Analysis

Configuration of Hadoop

Results and Analysis

Count the no of VISITORS from any country

Results and Analysis

Retrieving the ip address and displaying the state of visitors

Results and Analysis

Showing the no of ip access this category at a time

Results and Analysis

Initial stage of mapping and reduction

Results and Analysis

Category accessed by total no of ips

Results and Analysis

Showing shoes category acc to state access by total no of ip

Results and Analysis

showing details of ip accessed by visitors but gender wise

Result and Analysis

No of Females accessed this page

Result and Analysis

Total no of ip address accessed particular webpage

Result and Analysis

Calculate the sum of ages of all the visitors

Conclusion
The amount of clickstream data is rapidly
growing and with this demand for accessing
information
over
web
has
increased
significantly.
Therefore analyze the behavior and location
of the visitor.
It is inefficient to process large data using
traditional sequential method
Therefore MapReduce is used for processing
large datasets

Future Scope
Clickstream information play an important
role in a wide variety of applications such as
decision support systems, profile-based
marketing.
Location search is used by various industries
like telecom , e-commerce industry , in event
detection.
Nearest location method can be fused with
any other method to help in better way for
decision making.
Then the tradeoff would be done between
distance and other factor that would be fused

Thank you !!!

Web Clickstream Data Analysis Using A Dimensional Data Warehouse
No ratings yet
Web Clickstream Data Analysis Using A Dimensional Data Warehouse
84 pages
Ieee
No ratings yet
Ieee
8 pages
Clickstream Analysis Using Hadoop
No ratings yet
Clickstream Analysis Using Hadoop
16 pages
Web Usage Mining Techniques Explained
No ratings yet
Web Usage Mining Techniques Explained
34 pages
Web Analytics Tutorial Guide
No ratings yet
Web Analytics Tutorial Guide
29 pages
Clickstream Insights for Marketers
No ratings yet
Clickstream Insights for Marketers
10 pages
Module 1
No ratings yet
Module 1
29 pages
Clickstream Data & Purchase Intentions
No ratings yet
Clickstream Data & Purchase Intentions
96 pages
Big Data & Security Training Guide
No ratings yet
Big Data & Security Training Guide
106 pages
Bi Exam
No ratings yet
Bi Exam
24 pages
Case Study DSBA
No ratings yet
Case Study DSBA
21 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
47 pages
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
No ratings yet
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
45 pages
Click Stream Data
No ratings yet
Click Stream Data
16 pages
Web Analytics Tutorial Overview
No ratings yet
Web Analytics Tutorial Overview
29 pages
Webx
No ratings yet
Webx
59 pages
Business Analysis File
No ratings yet
Business Analysis File
62 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
No ratings yet
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
20 pages
Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
Hadoop and Architecture (Big Data Analytics)
No ratings yet
Hadoop and Architecture (Big Data Analytics)
54 pages
Big Data Insights for Security Analytics
No ratings yet
Big Data Insights for Security Analytics
49 pages
CCS334 BIG DATA ANALYTICS - Notes - Fullsyllabus
No ratings yet
CCS334 BIG DATA ANALYTICS - Notes - Fullsyllabus
94 pages
Chapter 14 Students
No ratings yet
Chapter 14 Students
39 pages
A PPVC Report On "Google Playstore Insights" Department of Computer Science and Engineering (Data Science)
No ratings yet
A PPVC Report On "Google Playstore Insights" Department of Computer Science and Engineering (Data Science)
30 pages
Clickstream Analysis
No ratings yet
Clickstream Analysis
25 pages
Ieee Conference Christy
No ratings yet
Ieee Conference Christy
5 pages
Service Management 4. Benchmarking - 2: Forbes Gibb Forbes - Gibb@cis - Strath.ac - Uk
No ratings yet
Service Management 4. Benchmarking - 2: Forbes Gibb Forbes - Gibb@cis - Strath.ac - Uk
27 pages
06) Web Analytics
No ratings yet
06) Web Analytics
47 pages
MapReduce Case Study: Web Log Analysis
100% (1)
MapReduce Case Study: Web Log Analysis
8 pages
What Is Iot: 5 V of Big Data
No ratings yet
What Is Iot: 5 V of Big Data
17 pages
Unit 1
No ratings yet
Unit 1
11 pages
IP Tracking'
No ratings yet
IP Tracking'
12 pages
Aintro and Projects
No ratings yet
Aintro and Projects
6 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Big Data
No ratings yet
Big Data
47 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Aiec T2
No ratings yet
Aiec T2
8 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
73 pages
Bda QB
No ratings yet
Bda QB
120 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
41 pages
Shopping Hard or Hardly Shopping Revealing Consume - Docx Abstract
100% (1)
Shopping Hard or Hardly Shopping Revealing Consume - Docx Abstract
61 pages
Big Data Spectrum
No ratings yet
Big Data Spectrum
61 pages
Big Data Analytics Unit-I
No ratings yet
Big Data Analytics Unit-I
38 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Big Data Analytics Set A
No ratings yet
Big Data Analytics Set A
15 pages
Big Data
No ratings yet
Big Data
5 pages
Intro
No ratings yet
Intro
47 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
31 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages
Big Data Fundamentals Guide
No ratings yet
Big Data Fundamentals Guide
80 pages
USR-WIFI232-X Quick Start Guide
No ratings yet
USR-WIFI232-X Quick Start Guide
18 pages
LTE Dedicated Bearer Activation Guide
No ratings yet
LTE Dedicated Bearer Activation Guide
2 pages
Ram Rom
87% (15)
Ram Rom
24 pages
EC2043 - Wireless Networks-Two Marks Questions and Answers-Libre
No ratings yet
EC2043 - Wireless Networks-Two Marks Questions and Answers-Libre
19 pages
Overview of Data Communication Systems
No ratings yet
Overview of Data Communication Systems
35 pages
Automated Visitor Counter With 7 Segment Display
100% (4)
Automated Visitor Counter With 7 Segment Display
59 pages
Cat Theory Study Guide (Based On Scope)
No ratings yet
Cat Theory Study Guide (Based On Scope)
16 pages
Superscalar and VLIW Processors
No ratings yet
Superscalar and VLIW Processors
10 pages
TBS5590 DVB Multi-Standard Probe User Guide
No ratings yet
TBS5590 DVB Multi-Standard Probe User Guide
13 pages
VHDL Tutorial PDF
No ratings yet
VHDL Tutorial PDF
163 pages
Lab Report Power Point
100% (1)
Lab Report Power Point
3 pages
ERP Systems for Business Leaders
No ratings yet
ERP Systems for Business Leaders
36 pages
SLUCT Configuration Overview
No ratings yet
SLUCT Configuration Overview
25 pages
Chapter 07
No ratings yet
Chapter 07
47 pages
SCU61E
No ratings yet
SCU61E
2 pages
Java Database
No ratings yet
Java Database
124 pages
IR2121 Low Side Driver Overview
No ratings yet
IR2121 Low Side Driver Overview
16 pages
Visual Notification Assistant Datasheet en
No ratings yet
Visual Notification Assistant Datasheet en
2 pages
19-04 EIGRP - The Enhanced Interior Gateway Routing Protocol
No ratings yet
19-04 EIGRP - The Enhanced Interior Gateway Routing Protocol
14 pages
Year 7 - Hardware and Software - Apr 15-Apr 19 2024
No ratings yet
Year 7 - Hardware and Software - Apr 15-Apr 19 2024
15 pages
OS - Unit 1 - Notes
No ratings yet
OS - Unit 1 - Notes
15 pages
Bosch Video Management System - Configuration Manual
100% (1)
Bosch Video Management System - Configuration Manual
260 pages
Darkest of Days - Manual - PC
No ratings yet
Darkest of Days - Manual - PC
14 pages
Multi Banking SRS
100% (1)
Multi Banking SRS
12 pages
UFT Jenkins Integration Guide
No ratings yet
UFT Jenkins Integration Guide
9 pages
E2 Lab 7 5 2
No ratings yet
E2 Lab 7 5 2
8 pages
Computer Networks
No ratings yet
Computer Networks
169 pages
Internet Connection Log 08/04/2020
No ratings yet
Internet Connection Log 08/04/2020
3 pages
Fatek e 4
No ratings yet
Fatek e 4
7 pages
Test Engineering Syllabus & Experiments
No ratings yet
Test Engineering Syllabus & Experiments
3 pages

Clickstream Data

Uploaded by

Clickstream Data

Uploaded by

Commercial Analytics

of Clickstream Data using

Open source project started by Doug Cutting

Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Parallelized (scalable) processing

CORE PARTS OF HADOOP

Hadoop Distributed File System(HDFS)

How HDFS works

Execution flow in MapReduce

1. Mapreduce program that has been written tells

Execution flow in MapReduce

2.This sends a message to the Jobtracker which

Execution flow in MapReduce

3. JobClient copies job resources , such as jar file.

Execution flow in MapReduce

4. Once the resources are in Distributed

Execution flow in MapReduce

5. The JobTracker does its own initialization for the

Execution flow in MapReduce

6. Now that the Jobtracker has work for

Execution flow in MapReduce

7. The TaskTracker need to obtain the code to

Execution flow in MapReduce

8. The TaskTracker now will run the job.

Upload Acme website log dataset contains about 4 million

Represent the dataset in unstructured format i.e

Represent the users data from the unstructured

Represent the products categories wise from

Shows the refine dataset of acme logfiles

Combine all the tables i.e acme log, products, users.

Results and Analysis

Results and Analysis

Count the no of VISITORS from any country

Results and Analysis

Retrieving the ip address and displaying the state of visitors

Results and Analysis

Showing the no of ip access this category at a time

Results and Analysis

Initial stage of mapping and reduction

Results and Analysis

Category accessed by total no of ips

Results and Analysis

Showing shoes category acc to state access by total no of ip

Results and Analysis

showing details of ip accessed by visitors but gender wise

Result and Analysis

No of Females accessed this page

Result and Analysis

Total no of ip address accessed particular webpage

Result and Analysis

Calculate the sum of ages of all the visitors

Thank you !!!

You might also like