Coders Packet

Word count MapReduce with hadoop using Java in Windows 10

By Lokesh Madhav S

The project deals with the basic MapReduce program using the Apache Hadoop framework in windows computers. The Hadoop framework installed is Pseudo Distributed (Single node).

This tutorial deals with running the MapReduce program on windows. Hadoop single node framework and JDK with eclipse are already installed.

Java JDK version "1.8.0_291" and Hadoop version "3.3.0" are used here; the operations resemble similar to other versions also.

Initially, open eclipse IDE and create a new project. Here the project name is map_reduce_example.

Now use configure build path in your project and add the following external Hadoop jar files to the project as shown below. (1st file in hadoop-3.3.0->share->hadoop->common, 2nd jar file in hadoop-3.3.0->share->hadoop->mapreduce)

After successfully adding the jar files, the eclipse will show items in referenced libraries, as shown below.

Next, add source code for the word-count program.

copy the below code in the WordCount.java class

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

After adding source code, create a jar file of map_reduce_example using the export option in eclipse IDE.

  1. Open cmd prompt with administrator status.
  2. Move to sbin folder of Hadoop using cd command (cd C:\hadoop-3.3.0\sbin)
  3. Start the daemons by giving command start-all or better use (start-dfs then start-yarn) for specific initialization.
  4. After starting, check if all nodes (namenode, datanode, resoucemanager, nodemanager) are working using the command (jps)
  5. Make an input directory in the Hadoop cluster using the command (hadoop fs -mkdir /input_directory).
  6. Now add the required text files to the input directory using the command (hadoop fs -put file_path/file_name.txt /input_directory) as shown below. test_wordcnt and test_wordcnt_2 are my input file with some words with spaces.
  7. Now run the map_reduce jar file exported previously using the Hadoop command (hadoop jar jarpath/jar_name.txt /input_directory /output_directory)    ---  leave space between input directory and output directory as shown.
  8. Finally check the output using command (hadoop dfs -cat /output_dir/*). By this, we have successfully executed the word count MapReduce program in windows.

 

Download mapreduce.zip attached here to get the WordCount.jar and input files used in this tutorial.

Download Complete Code

Comments

No comments yet

Download Packet

Reviews Report

Submitted by Lokesh Madhav S (2208loki)

Download packets of source code on Coders Packet