Hadoop Map-Reduce quick tutorial

ML-for-NLP discussion on Hadoop.

When to use Hadoop? If you have a large input file that you want to split and run each split as an input to a program (i.e. the mapper), and later process (i.e. the reducer) the output of each mapper, then Hadoop is for you.

Hadoop files are stored on Distributed File System. Each file is split based on lines. Several lines make a split, also called block. These blocks are replicated on multiple systems. See the image below. Multiple mappers/reducers may run on the same block but on different nodes. So even if a node in the cluster fails, your program does not terminate.

Image Source

The mapper takes each block as input and performs some computation. The output of your mapper should be in the format - (key, value) pairs. By default, each line in your mapper's output is treated as the key. In the shuffling/sorting stage, all the output lines which have same key are sent to the same node. The node sorts this data based on the keys. This sorted output becomes an input to the reducer. The reducer takes (key, value) pairs and performs further computation giving the final output. You will have as many output files as the number of reducers.

Here is a quick visualisation:

Image Source

In Edinburgh, we have a Hadoop cluster accessible at namenode.inf.ed.ac.uk. Just ssh into it and try the following simple exercise:

Here I demonstrate Hadoop with Word Count program:

Input File: input.txt: An article from BBC.
Mapper Program word_splitter.py to split a line into words.
Reducer Program count_words.py to count words from a sorted input. Ben suggested even shorter program.

Copying input.txt to Hadoop file system

Command: hadoop dfs -copyFromLocal input.txt /user/s1051585/data/ml-for-nlp/

Run Hadoop with default Mapper (unix command: cat) and Reducer (unix command: cat)

Command: hadoop jar hadoop-0.20.2-streaming.jar \
-input /user/s1051585/data/ml-for-nlp/input.txt \
-output /user/s1051585/data/ml-for-nlp/mapCat_redCat \
-mapper cat \
-reducer cat

Output: mapCat_redCat

Run Hadoop with Mapper word_splitter.py and default Reducer

Command: hadoop jar hadoop-0.20.2-streaming.jar \
-input /user/s1051585/data/ml-for-nlp/input.txt \
-output /user/s1051585/data/ml-for-nlp/wordCount_without_reducer \
-mapper word_splitter.py \
-file word_splitter.py \
-jobconf mapred.job.name="wordCount_without_reducer"

Output: wordCount_without_reducer

Run Hadoop with Mapper word_splitter.py and Reducer count_words.py

Command: hadoop jar hadoop-0.20.2-streaming.jar \
-input /user/s1051585/data/ml-for-nlp/input.txt \
-output /user/s1051585/data/ml-for-nlp/wordCount_with_reducer \
-mapper word_splitter.py \
-file word_splitter.py \
-reducer count_words.py \
-file count_words.py \
-jobconf mapred.job.name="wordCount_with_reducer" \
-jobconf mapred.map.tasks=10

Output: wordCount_with_reducer

Simple Exercises with explanation

Exercise 1: Hadoop commands
Exercise 2: Your first Hadoop program

Acknowledgements: Miles Osborne for organizing Hadoop Hackathon and Bharat Ram Ambati for being my Hadoop Hackathon partner.

Site Counter