0% found this document useful (0 votes)

287 views

Exercise 6 PDF

The document describes two Hadoop exercises: 1) a word count program that counts the frequency of words in texts from Shakespeare and the Bible, and 2) a k-mer counting program that counts overlapping k-length sequences in a genome file to find the most common k-mers. It provides instructions on running hadoop jar to perform word counting on sample input files and find the top 10 words, and on implementing k-mer counting to find the top 10 most frequent 9-mers in an E. coli genome file.

Uploaded by

việt lê

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

287 views

Exercise 6 PDF

Uploaded by

việt lê

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Exercise 6

1. Hadoop Word Count

The next program to test is the hadoop word count program. This example reads text files and
counts how often words occur. The input is text files and the output is text files, each line of which
contains a word and the count of how often it occured, separated by a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the
word and 1. Each reducer sums the counts for each word and emits a single key/value with the
word and sum. As an optimization, the reducer is also used as a combiner on the map outputs. This
reduces the amount of data sent across the network by combining each word into a single record.
Before you can run the example, you'll have to copy some data into the distributed filesystem
(HDFS). Here we will create an input directory, and copy in the complete works of Shakespeare
and the bible (a standard large corpus for text mining)
The datafile is also avalable at - make sure to gunzip after downloading: bible+shakes.nopunc.gz
$ hadoop fs -mkdir /user/USERNAME/wordcount

$ hadoop fs -mkdir /user/USERNAME/wordcount/input

$ hadoop fs -put /bluearc/data/schatz/data/textmining/bible+shakes.nopunc

/user/mschatz/wordcount/input

To run the example, the command syntax is

$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount \

/user/USERNAME/wordcount/input \

/user/USERNAME/wordcount/output

After this completes, download the results to your local directory like this:
$ hadoop fs -get /user/USERNAME/wordcount/output output

Question: What are the top 10 most frequently used words in the corpus?
Hint: Use the unix commands sort and head to scan the output file

2. Hadoop Kmer Counting

The next exercise will be to implement a kmer counter using hadoop. Conceptually this is very
similar to the wordcount program, but since there are no spaces in the human genome, we will
count overlapping kmers instead of discrete words.
The idea is if the genome is:
>chr1

ACACACAGT
And we are counting 3-mers, your map function will output
ACA 1

CAC 1

ACA 1

CAC 1

ACA 1

CAG 1

AGT 1

The shuffle function will sort them so the same key comes right after each other
ACA 1

ACA 1

CAC 1

CAG 1

AGT 1

And your reducer will output:

ACA 3

CAC 2

CAG 1

AGT 1

You can implement this in Java, using the WordCount program as an example, or you can use
Hadoop Streaming to implement it in any language you would like.
The Hadoop Streaming documentation describes how to use it:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
And here is a nice tutorial using Python:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
The genome file is available here: ecoli.fa.gz
Question: What are the top 10 most frequently occurring 9-mers in E coli?

Android Security
No ratings yet
Android Security
32 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Gcloud Python
No ratings yet
Gcloud Python
398 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Hands On
No ratings yet
Hands On
26 pages
13 - m1 - Linux Basic Commands - Edureka VM PDF
No ratings yet
13 - m1 - Linux Basic Commands - Edureka VM PDF
3 pages
A) Circle The Correct Plural: B) Complete The Chart
No ratings yet
A) Circle The Correct Plural: B) Complete The Chart
2 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
Sai Hadoop Resume
No ratings yet
Sai Hadoop Resume
5 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Gcloud 1
No ratings yet
Gcloud 1
388 pages
Updated Resume
No ratings yet
Updated Resume
2 pages
Pitchaiah Kuruvella: TCS Internal
No ratings yet
Pitchaiah Kuruvella: TCS Internal
6 pages
1.4 HDFS Lab 1H
No ratings yet
1.4 HDFS Lab 1H
23 pages
Deepak Professional Summary
No ratings yet
Deepak Professional Summary
3 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Krishna Resume
No ratings yet
Krishna Resume
2 pages
Hadoop Admin Interview Questions and Answers
No ratings yet
Hadoop Admin Interview Questions and Answers
9 pages
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
No ratings yet
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
35 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
Abhijit Thakur - Resume
No ratings yet
Abhijit Thakur - Resume
4 pages
Interview Questions
No ratings yet
Interview Questions
4 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Hadoop Hdfs Commands
No ratings yet
Hadoop Hdfs Commands
5 pages
MapReduce Example
No ratings yet
MapReduce Example
76 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Scala Basic Interview Questions
No ratings yet
Scala Basic Interview Questions
16 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Spark Details
No ratings yet
Spark Details
11 pages
Hadoop Singlenode
No ratings yet
Hadoop Singlenode
43 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
No ratings yet
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
13 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Singular and Plural Nouns WS3 PDF
100% (1)
Singular and Plural Nouns WS3 PDF
3 pages
Mohit BigData 5yr
100% (1)
Mohit BigData 5yr
3 pages
Hadoop Training #1: Thinking at Scale
100% (1)
Hadoop Training #1: Thinking at Scale
20 pages
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
38 pages
Hive Join
No ratings yet
Hive Join
6 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Big Data
No ratings yet
Big Data
25 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Activity 2
No ratings yet
Activity 2
31 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Perfectly Sorting Pivot Table Qlikview
No ratings yet
Perfectly Sorting Pivot Table Qlikview
2 pages
Online Transaction Processing, or OLTP, Refers To A Class of
No ratings yet
Online Transaction Processing, or OLTP, Refers To A Class of
21 pages
Crypto Sniper Bot Development Process
No ratings yet
Crypto Sniper Bot Development Process
1 page
Patricio Tomas Monpelat: Experience
No ratings yet
Patricio Tomas Monpelat: Experience
1 page
Hoffer Mdm11e PP Ch11-JSF
No ratings yet
Hoffer Mdm11e PP Ch11-JSF
33 pages
Aws Certified Solutions Architect Associate (Saa-C02)
No ratings yet
Aws Certified Solutions Architect Associate (Saa-C02)
70 pages
Nitin_Kumar_Resume
No ratings yet
Nitin_Kumar_Resume
1 page
Analysing Oracle AWR2
100% (1)
Analysing Oracle AWR2
17 pages
IoT Unit 2
No ratings yet
IoT Unit 2
11 pages
Steps To Upload The Source To AS400 Via ISeries Navigator
No ratings yet
Steps To Upload The Source To AS400 Via ISeries Navigator
2 pages
Procedures: CH Vengaiah, Asst Professor Department of Cse
No ratings yet
Procedures: CH Vengaiah, Asst Professor Department of Cse
9 pages
Zarafa Collaboration Platform 6.40.0 Administrator Manual en US
No ratings yet
Zarafa Collaboration Platform 6.40.0 Administrator Manual en US
126 pages
Prometheus Group - DataSheet - Maxavera - Integrate SAP PM With Primavera
No ratings yet
Prometheus Group - DataSheet - Maxavera - Integrate SAP PM With Primavera
2 pages
San Unit 1 Introduction Complete Notes Compiled
No ratings yet
San Unit 1 Introduction Complete Notes Compiled
15 pages
Ccms Agents FG
No ratings yet
Ccms Agents FG
12 pages
EC 240 Database Engineering: Agenda
No ratings yet
EC 240 Database Engineering: Agenda
16 pages
PCO Review Report
No ratings yet
PCO Review Report
16 pages
DCIT 101 ASSIGNMENT 10917663docx
No ratings yet
DCIT 101 ASSIGNMENT 10917663docx
9 pages
TechCorp IAM Solution Designs
No ratings yet
TechCorp IAM Solution Designs
2 pages
Pub Net Overview Publisher
No ratings yet
Pub Net Overview Publisher
8 pages
THR81 SQ1 Withoutanswers
No ratings yet
THR81 SQ1 Withoutanswers
23 pages
DDB-distribution Database Important.
No ratings yet
DDB-distribution Database Important.
15 pages
Pipeline Attributes Help
No ratings yet
Pipeline Attributes Help
7 pages
A Risk Based Story Prioritization Technique in An Agile Environment
No ratings yet
A Risk Based Story Prioritization Technique in An Agile Environment
10 pages
Threat Modeling - OWASP Cheat Sheet Series
No ratings yet
Threat Modeling - OWASP Cheat Sheet Series
7 pages
TIP-CC-037 Company Information Sheet
No ratings yet
TIP-CC-037 Company Information Sheet
1 page
2023 June ITT206-E
No ratings yet
2023 June ITT206-E
5 pages
Operations and Metric Analytics - Case Study
No ratings yet
Operations and Metric Analytics - Case Study
17 pages
Whitman Ch07
No ratings yet
Whitman Ch07
37 pages

Exercise 6 PDF

Uploaded by

Exercise 6 PDF

Uploaded by

Exercise 6

1. Hadoop Word Count

$ hadoop fs -mkdir /user/USERNAME/wordcount/input

$ hadoop fs -put /bluearc/data/schatz/data/textmining/bible+shakes.nopunc

To run the example, the command syntax is

2. Hadoop Kmer Counting

And your reducer will output:

You might also like