GitHub - Atkinssamuel - Applied-Map-Reduce
GitHub - Atkinssamuel - Applied-Map-Reduce
Readme
java-tutorial 12/02/2021-Atkins Comp... 3 years ago
Activity
kmeans 12/02/2021-Atkins Comp... 3 years ago
0 stars
0 forks
shakespeare-line-c... 12/02/2021-Atkins Comp... 3 years ago
Report repository
.gitignore 12/02/2021-Atkins Comp... 3 years ago
Releases
README.md 12/02/2021-Atkins Updat... 3 years ago
No releases published
applied-map-reduc... 12/02/2021-Atkins Comp... 3 years ago
No packages published
README
Languages
https://github.com/atkinssamuel/applied-map-reduce 1/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce
Project Description
The purpose of this project is to gain a thorough understanding of Hadoop
MapReduce through application. MapReduce is applied in two different contexts. The
first is a kMeans clustering algorithm on a large dataset of 2D points. The second is a
line counting program that performs on a large set of text data. Through these
applications I was able to become proficient in MapReduce and Java.
k-Means Clustering
Advantages:
Speed
MapReduce allows computations to be executed in parallel. Unused
computing resources as a result of the constraint of sequential operations
https://github.com/atkinssamuel/applied-map-reduce 2/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce
can now be utilized. In the context of k-Means, the distance metrics of each
point with respect to the centroids can be computed in parallel which results
in a huge speed up.
Scalability
As the dataset grows, the demand for parallelization grows. MapReduce
scales much better than standard loop-based calculations because of the
ability to harness all available computing power. Given that our dataset is
quite large, the scalability of MapReduce is apparent.
Control
MapReduce allows us to control exactly how we harness the computing
power. If we decide to change our algorithm, we can optimize our computing
power accordingly to maintain reasonable performance. Suppose we wanted
to modify our existing algorithm to include canopy pre-clustering. We would
easily be able to specify exactly how we would like our computing resources
to be utilized.
Disadvantages:
Complexity
For some, using MapReduce to implement k-Means may not be intuitive. In
the context of code understanding, MapReduce adds a layer of complexity to
k-Means. If we were to implement this algorithm using SkLearn, for example,
it would be a one-line intuitive solution.
Flexibility
MapReduce constrains us to thinking in terms of a Mapper and a Reducer.
Some applications are difficult to formulate in this way. Given the simplicity
of our algorithm, we were able to formulate a solution using a Mapper and a
Reducer. A more complex algorithm that might build on top of k-Means
could potentially prove difficult to formulate using a Mapper and a Reducer.
https://github.com/atkinssamuel/applied-map-reduce 3/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce
Without canopy clustering, we must compare all N data points with all k clusters
resulting in kN comparisons each centroid update. With canopy clustering, we only
need to compare the points in overlapping canopies. Each canopy contains fn/c points
where f is the amount in which the canopies overlap, n is the number of data points,
and c is the number of canopies. Each cluster needs to compare the points in its own
canopy with the points in the overlapping canopies. For all clusters, this results in
nkf^2/c comparisons per operation which, for f close to 1, is a 1/c speed up.
With respect to distance metrics in the context of k-Means clustering, we could use the
Manhattan distance as the loose distance metric threshold. This would result in a rapid
comparison which is desired. We could alternatively use the Euclidean distance which
would be slower but more accurate. For the tight distance metric, we desire more
accuracy, so the Euclidean distance should be used.
https://github.com/atkinssamuel/applied-map-reduce 4/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce
Then, the k-Means algorithm implemented in this project can be applied to each of the
defined canopies. Each cluster needs to compare the points in its own canopy with the
points in the overlapping canopies. Thus, the k-Means driver code present in the Main
class must be modified to accommodate this key difference. The k-Means algorithm
will then iterate until convergence and given appropriate distance thresholds, the
algorithm will converge to an optimal solution much faster than a vanilla k-Means
implementation.
https://github.com/atkinssamuel/applied-map-reduce 5/6
3/10/24, 4:14 AM GitHub - atkinssamuel/applied-map-reduce
https://github.com/atkinssamuel/applied-map-reduce 6/6