0% found this document useful (0 votes)

27 views

GitHub - Atkinssamuel - Applied-Map-Reduce

This document describes a GitHub repository containing code for applying MapReduce algorithms. It includes Java code for k-means clustering on a large dataset of 2D points and a line counting program on text data. The repository owner gained experience with MapReduce and Java through these applications. Advantages of using MapReduce for k-means clustering include speed through parallelization and scalability to large datasets. Disadvantages include added complexity compared to other implementations and constraints of the MapReduce programming model. The document also discusses using canopy selection to reduce the number of distance comparisons in k-means clustering.

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

GitHub - Atkinssamuel - Applied-Map-Reduce

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce

atkinssamuel / applied-map-reduce Public

Code Issues Pull requests Actions Projects Security Insights

main 1 Branch 0 Tags Go to file Go to file About

Code

No description, website, or topics

atkinssamuel 12/02/2021-... 10 Commits
3d4985b · 3 years ago provided.

Readme
java-tutorial 12/02/2021-Atkins Comp... 3 years ago
Activity
kmeans 12/02/2021-Atkins Comp... 3 years ago
0 stars

map-reduce-exam... 11/02/2021-Atkins Comp... 3 years ago 1 watching

0 forks
shakespeare-line-c... 12/02/2021-Atkins Comp... 3 years ago
Report repository
.gitignore 12/02/2021-Atkins Comp... 3 years ago

Releases
README.md 12/02/2021-Atkins Updat... 3 years ago
No releases published
applied-map-reduc... 12/02/2021-Atkins Comp... 3 years ago

instructions.pdf 11/02/2021-Atkins Comp... 3 years ago Packages

No packages published

README
Languages

MapReduce Applied Java 100.0%

https://github.com/atkinssamuel/applied-map-reduce 1/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce

Project Description
The purpose of this project is to gain a thorough understanding of Hadoop
MapReduce through application. MapReduce is applied in two different contexts. The
first is a kMeans clustering algorithm on a large dataset of 2D points. The second is a
line counting program that performs on a large set of text data. Through these
applications I was able to become proficient in MapReduce and Java.

k-Means Clustering

k-Means Clustering Implementation

k-Means clustering with MapReduce was implemented in the kmeans directory. k=4
and k=8 were used, and the results have been included in
kmeans/results/results.txt . The CLI was not used because of my familiarity with
IntelliJ. To run the MapReduce programs, I created a Java Console Application and
modified the run configuration with my command line arguments. I have included the
full output of running my k-Means MapReduce program in the
kmeans/results/console_output.txt file. The jar files for both the k-Means
algorithm and the line counter are present in the jars folder.

Advantages & Disadvantages of k-Means with MapReduce

The advantages and disadvantages of using MapReduce in the context of k-Means
clustering are as follows:

Advantages:

Speed
MapReduce allows computations to be executed in parallel. Unused
computing resources as a result of the constraint of sequential operations
https://github.com/atkinssamuel/applied-map-reduce 2/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce

can now be utilized. In the context of k-Means, the distance metrics of each
point with respect to the centroids can be computed in parallel which results
in a huge speed up.
Scalability
As the dataset grows, the demand for parallelization grows. MapReduce
scales much better than standard loop-based calculations because of the
ability to harness all available computing power. Given that our dataset is
quite large, the scalability of MapReduce is apparent.
Control
MapReduce allows us to control exactly how we harness the computing
power. If we decide to change our algorithm, we can optimize our computing
power accordingly to maintain reasonable performance. Suppose we wanted
to modify our existing algorithm to include canopy pre-clustering. We would
easily be able to specify exactly how we would like our computing resources
to be utilized.

Disadvantages:

Complexity
For some, using MapReduce to implement k-Means may not be intuitive. In
the context of code understanding, MapReduce adds a layer of complexity to
k-Means. If we were to implement this algorithm using SkLearn, for example,
it would be a one-line intuitive solution.
Flexibility
MapReduce constrains us to thinking in terms of a Mapper and a Reducer.
Some applications are difficult to formulate in this way. Given the simplicity
of our algorithm, we were able to formulate a solution using a Mapper and a
Reducer. A more complex algorithm that might build on top of k-Means
could potentially prove difficult to formulate using a Mapper and a Reducer.

https://github.com/atkinssamuel/applied-map-reduce 3/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce

Canopy Selection for k-Means Clustering

Reducing the Number of Distance Comparisons Using Canopy

Selection
The canopy selection algorithm allows one to dramatically reduce the number of
comparisons used in a clustering algorithm. The algorithm classifies the points into
canopies using two distance metrics. The first distance metric is a loose distance
metric that is used to add points to a given canopy. This distance metric should be
extremely fast. The second distance metric is a tight metric that is used to restrict
points from being added to other canopies. This distance metric should be more
accurate. After the points are initially clustered to into canopies, they can be further
clustered using a more accurate and less-efficient algorithm like k-Means.

Without canopy clustering, we must compare all N data points with all k clusters
resulting in kN comparisons each centroid update. With canopy clustering, we only
need to compare the points in overlapping canopies. Each canopy contains fn/c points
where f is the amount in which the canopies overlap, n is the number of data points,
and c is the number of canopies. Each cluster needs to compare the points in its own
canopy with the points in the overlapping canopies. For all clusters, this results in
nkf^2/c comparisons per operation which, for f close to 1, is a 1/c speed up.

With respect to distance metrics in the context of k-Means clustering, we could use the
Manhattan distance as the loose distance metric threshold. This would result in a rapid
comparison which is desired. We could alternatively use the Euclidean distance which
would be slower but more accurate. For the tight distance metric, we desire more
accuracy, so the Euclidean distance should be used.

Canopy Selection on MapReduce

https://github.com/atkinssamuel/applied-map-reduce 4/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce

Using canopy selection with MapReduce offers further performance enhancement.

During every map the mapper will determine if each data point is within the distance
threshold of any already specified canopy centers. If the point is within this threshold
then it is discarded. Otherwise, this point will be added to a list of canopy center
candidates. The reducer receives the canopy centers. It removes the canopies that are
within the same threshold (i.e. duplicate canopy candidates for the same canopy
center). These distance metrics are then used to determine which points belong to
which candidates. Remember that points can belong to multiple canopies.

Canopy Selection and k-Means Clustering

Canopy pre-clustering and k-Means can be used in conjunction with MapReduce to
rapidly speed up the clustering process. First, the MapReduce canopy pre-clustering
method detailed above is applied. This results in a series of points belonging to
various canopies.

Then, the k-Means algorithm implemented in this project can be applied to each of the
defined canopies. Each cluster needs to compare the points in its own canopy with the
points in the overlapping canopies. Thus, the k-Means driver code present in the Main
class must be modified to accommodate this key difference. The k-Means algorithm
will then iterate until convergence and given appropriate distance thresholds, the
algorithm will converge to an optimal solution much faster than a vanilla k-Means
implementation.

https://github.com/atkinssamuel/applied-map-reduce 5/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce

https://github.com/atkinssamuel/applied-map-reduce 6/6

Report Assignment2 Comp1100 Full Marks
No ratings yet
Report Assignment2 Comp1100 Full Marks
7 pages
BAPI PO Creation - Example & Documentation
No ratings yet
BAPI PO Creation - Example & Documentation
6 pages
Lecture8 MapReduce 2023
No ratings yet
Lecture8 MapReduce 2023
27 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
No ratings yet
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
8 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Map Reduced B Seminar
No ratings yet
Map Reduced B Seminar
17 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Map Reduce Summary
No ratings yet
Map Reduce Summary
4 pages
MapReduce_Quora
No ratings yet
MapReduce_Quora
39 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
The Mapreduce Programming Model
No ratings yet
The Mapreduce Programming Model
64 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
Parallel Data Processing in The Cloud
No ratings yet
Parallel Data Processing in The Cloud
25 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Cui 2014
No ratings yet
Cui 2014
11 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
31 pages
Hands On Mahout - Mammoth Scale Machine Learning Presentation
No ratings yet
Hands On Mahout - Mammoth Scale Machine Learning Presentation
68 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
K-Means Mapreduce Example
No ratings yet
K-Means Mapreduce Example
33 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
MapReduce - A Flexible DP Tool
No ratings yet
MapReduce - A Flexible DP Tool
6 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Implementing K Means For Achievement Stu
No ratings yet
Implementing K Means For Achievement Stu
5 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
MR Databases
No ratings yet
MR Databases
52 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Mapreduce article review
No ratings yet
Mapreduce article review
8 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
No ratings yet
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
5 pages
Bda Lab Manual Symca .Docx-1
No ratings yet
Bda Lab Manual Symca .Docx-1
18 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
BFSMpR:A BFS Graph Based Recommendation System Using Map Reduce
No ratings yet
BFSMpR:A BFS Graph Based Recommendation System Using Map Reduce
5 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Lec 6
No ratings yet
Lec 6
16 pages
End-To-End Optimization For Geo-Distributed Mapreduce
No ratings yet
End-To-End Optimization For Geo-Distributed Mapreduce
14 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
14 MapReduce
100% (1)
14 MapReduce
82 pages
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
Hadoop
No ratings yet
Hadoop
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
Topic 2 - The Relational Data Model 1
No ratings yet
Topic 2 - The Relational Data Model 1
13 pages
Oops Concepts 2. Alv Reporting Using Classes 3. Alv Reporting Using Function Modules (2-3 HRS.)
No ratings yet
Oops Concepts 2. Alv Reporting Using Classes 3. Alv Reporting Using Function Modules (2-3 HRS.)
91 pages
Cohesion and Coupling
No ratings yet
Cohesion and Coupling
40 pages
Log
No ratings yet
Log
41 pages
Chatterbot Readthedocs Io en Latest
No ratings yet
Chatterbot Readthedocs Io en Latest
71 pages
Name: Abhijeet Anand Program: MS in Data Science/Analytics
No ratings yet
Name: Abhijeet Anand Program: MS in Data Science/Analytics
2 pages
Manual s7 Basico
100% (1)
Manual s7 Basico
280 pages
Mongodb MCQ
No ratings yet
Mongodb MCQ
3 pages
PYTHON (UNIT II)
No ratings yet
PYTHON (UNIT II)
36 pages
WBP Summer 2023 Model Answer Paper
No ratings yet
WBP Summer 2023 Model Answer Paper
26 pages
OS Syllabus
No ratings yet
OS Syllabus
3 pages
FinalDocument - Shopping Mall Administration-1
No ratings yet
FinalDocument - Shopping Mall Administration-1
46 pages
Itc All Asigment
No ratings yet
Itc All Asigment
31 pages
Week 3-Multidimensional Array Concepts, Searching Sorting
No ratings yet
Week 3-Multidimensional Array Concepts, Searching Sorting
44 pages
Define Programming Language
No ratings yet
Define Programming Language
6 pages
Data-Structures Short Questions
No ratings yet
Data-Structures Short Questions
28 pages
Parsing
No ratings yet
Parsing
27 pages
Data Structure and Algorithm Lab Manual
No ratings yet
Data Structure and Algorithm Lab Manual
60 pages
Log
No ratings yet
Log
89 pages
Using UART of LPC2148 For Serial Reception and Transmission From/to Computer
No ratings yet
Using UART of LPC2148 For Serial Reception and Transmission From/to Computer
4 pages
PMM Interface Procedure
No ratings yet
PMM Interface Procedure
9 pages
A Simulation Template For Wireless Sensor Networks: Stefan Dulman, Paul Havinga
No ratings yet
A Simulation Template For Wireless Sensor Networks: Stefan Dulman, Paul Havinga
2 pages
The Game of Twenty-One
No ratings yet
The Game of Twenty-One
6 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
110 pages
HLOOKUPVSVLOOKUP
No ratings yet
HLOOKUPVSVLOOKUP
12 pages
OpenModelica Brochure English
No ratings yet
OpenModelica Brochure English
2 pages
Microcontrollers and Embedded Systems Unit 2:8051 Programming
No ratings yet
Microcontrollers and Embedded Systems Unit 2:8051 Programming
8 pages
Programming Fundamentals Using C++ Question Paper 2016 - Tutorialsduniya
No ratings yet
Programming Fundamentals Using C++ Question Paper 2016 - Tutorialsduniya
13 pages