Exercise 6 PDF
Exercise 6 PDF
/user/USERNAME/wordcount/input \
/user/USERNAME/wordcount/output
After this completes, download the results to your local directory like this:
$ hadoop fs -get /user/USERNAME/wordcount/output output
Question: What are the top 10 most frequently used words in the corpus?
Hint: Use the unix commands sort and head to scan the output file
ACACACAGT
And we are counting 3-mers, your map function will output
ACA 1
CAC 1
ACA 1
CAC 1
ACA 1
CAG 1
AGT 1
The shuffle function will sort them so the same key comes right after each other
ACA 1
ACA 1
ACA 1
CAC 1
CAC 1
CAG 1
AGT 1
CAC 2
CAG 1
AGT 1
You can implement this in Java, using the WordCount program as an example, or you can use
Hadoop Streaming to implement it in any language you would like.
The Hadoop Streaming documentation describes how to use it:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
And here is a nice tutorial using Python:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
The genome file is available here: ecoli.fa.gz
Question: What are the top 10 most frequently occurring 9-mers in E coli?