Thomas Jungblut's Blog: Controlling Hadoop MapReduce Job recursion

Apr 9, 2011

Controlling Hadoop MapReduce Job recursion

This post is related to the previous post.

Sometimes you coming along problems that need to be solved in a recursive manner. For example the graph exploration algorithm in my previous post.
You have to chain the jobs and let the next job work on the output of the previous job. And of course you need a breaking condition. This could either be a fixed limit of "how many recursions it should do" or "how many recursion it really does".
Let me focus at the second breaking condition along with my graph exploration example.

Counter
First off you should know that in Hadoop you have counters, you may see them after a job ran or in the Webinterface of the Jobtracker. "Famous" counters are the "Map input records" or the "Reduce output records".
The best of all is that we can setup our own counters, just with the use of enums.

How to setup Counter?
The simplest approach is to just define an enum like this:

public enum UpdateCounter {
  UPDATED
 }

Now you can manipulate the counter using:

context.getCounter(UpdateCounter.UPDATED).increment(1);

"context" is the context object you get from your mapper or your reducer.
This line will obviously increment your update counter by 1.

How to fetch the counter?

This is as easy as setting up an enum. You are submitting a job like this:

Configuration conf = new Configuration();
  Job job = new Job(conf);
  job.setJobName("Graph explorer");

  job.setMapperClass(DatasetImporter.class);
  job.setReducerClass(ExplorationReducer.class);
  // leave out the stuff with paths etc.
  job.waitForCompletion(true);

Be sure that the job has finished, using waitForCompletion is recommended. Querying the counter during runtime can end in strange results ;)
You can access your counter like this:

long counter = job.getCounters().findCounter(ExplorationReducer.UpdateCounter.UPDATED)
    .getValue();

How to get the recursion running?

Now we know how to get the counter. Now setting up a recursion is quite simple. The only thing that you should watch for is already existing paths from older job runs.
Look at this snippet:

// variable to keep track of the recursion depth
int depth = 0;
// counter from the previous running import job
long counter = job.getCounters().findCounter(ExplorationReducer.UpdateCounter.UPDATED)
    .getValue();

  depth++;
  while (counter > 0) {
   // reuse the conf reference with a fresh object
   conf = new Configuration();
   // set the depth into the configuration
   conf.set("recursion.depth", depth + "");
   job = new Job(conf);
   job.setJobName("Graph explorer " + depth);

   job.setMapperClass(ExplorationMapper.class);
   job.setReducerClass(ExplorationReducer.class);
   job.setJarByClass(ExplorationMapper.class);
   // always work on the path of the previous depth
   in = new Path("files/graph-exploration/depth_" + (depth - 1) + "/");
   out = new Path("files/graph-exploration/depth_" + depth);

   SequenceFileInputFormat.addInputPath(job, in);
   // delete the outputpath if already exists
   if (fs.exists(out))
    fs.delete(out, true);

   SequenceFileOutputFormat.setOutputPath(job, out);
   job.setInputFormatClass(SequenceFileInputFormat.class);
   job.setOutputFormatClass(SequenceFileOutputFormat.class);
   job.setOutputKeyClass(LongWritable.class);
   job.setOutputValueClass(VertexWritable.class);
   // wait for completion and update the counter
   job.waitForCompletion(true);
   depth++;
   counter = job.getCounters().findCounter(ExplorationReducer.UpdateCounter.UPDATED)
     .getValue();
  }

Note that if you never incremented your counter it will be always 0. Or it could be null of you never used it in your mapper or reducer.

Full sourcecodes can always be found here:
http://code.google.com/p/hama-shortest-paths/

27 comments:

AhmedJanuary 9, 2012 at 7:53 AM
Your blog posts are very helpful for me.
One question. What if we wanted to use TextInputFormat?
KeyValueTextInputFormat is the one that is designed for "iterative" mapreduce jobs. But, unfortunately it is no longer supported in the new API.
ReplyDelete
Replies
Thomas JungblutJanuary 9, 2012 at 9:45 AM
Hi,

you can use the TextInputFormat as well, but you have to make sure that you output text as well. So you have to use the TextOutputFormat, which should exactly output the formatting like your previous input.

If you can't archieve that formatting, you should better add a preprocessing job and stick with SequenceFiles.
ReplyDelete
Replies
Abdulrahman KaitouaJanuary 15, 2012 at 2:22 PM
Hi,

Your job is so helpful.
Is there a way to cache the output of the job for the next job, Skip the HHD writing or just Skip HDD input the reading process.

Regards,
ReplyDelete
Replies
Thomas JungblutJanuary 16, 2012 at 9:59 AM
Hi,

if you are in distributed mode, the distribution of the splits are "completely" random. So Hadoop itself won't benefit from caching.

If you are searching for a full cached solution, you should take a look into Spark.

http://www.spark-project.org

Or you can take a look at Apache Hama, there you can use caching very well and the iteration is much faster than Hadoop MapReduce.
ReplyDelete
Replies
AnonymousJanuary 21, 2012 at 11:53 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Praveen SripatiApril 16, 2012 at 5:43 AM
Where is the counter incremented in the code and when do we get out of the loop?
ReplyDelete
Replies
Thomas JungblutApril 16, 2012 at 7:28 AM
The counter is incremented in the mapreduce job that is run between the while loop. Which has the breaking condition: (counter > 0)
ReplyDelete
Replies
EngineerMay 10, 2012 at 5:30 PM
Hi, thanks for this tutorial, it's really helpful.
But what if some reducer tasks increment the counter but some do not?
ReplyDelete
Replies
Thomas JungblutMay 10, 2012 at 5:31 PM
Hey,

it checks the counters after the job run, so it takes the sum of all reducers of a single job.
If you have jobs where all the reducers do not increment the counter, well then this won't work and you have to find another metric.
ReplyDelete
Replies
EngineerMay 10, 2012 at 5:35 PM
You mean if some reducer increment but some do not, this mechanism for recursive is not suitable?
ReplyDelete
Replies
EngineerMay 10, 2012 at 5:47 PM
Oh! I got the idea of your example!
All I have to do is to make sure that the non-converged reducer tasks will increment the counter, and check if counter is still > 0.
Thanks very much again!
ReplyDelete
Replies
RSMay 17, 2012 at 11:14 PM
hi.. Is there any output type in Hadoop which can output a graph structure? or how are graphs in general implemented in hadoop?
ReplyDelete
Replies
Thomas JungblutMay 18, 2012 at 9:33 AM
No there is no built in structure within Hadoop.
Graphs are somewhat abstract, so you actually can express it with an adjacency list by using a key and a list of keys as value.

In Hadoop generics spoken you would have something like this:

< Text, ArrayWritable >

, whereas ArrayWritable is consisting of keys of Text. This is then your adjacency list. Now you can run fancy graph algorithms on it ;)
ReplyDelete
Replies
UnknownJuly 27, 2012 at 5:12 AM
Nice
ReplyDelete
Replies
UnknownAugust 5, 2012 at 9:36 AM
hello,I want to ask you a question,
>>>i know in the reduce() fuction,if vertex is updated(this vertex is set activated too) the counter is increment(1);if vertex is not updated(this vertex will be set to no-activated) the counter will noe increment.
>>>in the main() function,you use while(count>0) as loop condition.
-----my question is : in the begining itrations,in the reduce() function the counter will incremnet,and in the main() function,while will be executed。and if in one iteration, in the reduce() function,counter is not incremented,this means that no vertex is updated,so the while() will stop,but the while(counter>0) is also true(i have not see the decrement of counter),loop will continue。
i want to ask,how the counter decrease ?so the loop condition will stop.
ReplyDelete
Replies
Thomas JungblutAugust 5, 2012 at 10:07 AM
Hi the counter never decreases, but in every MapReduce job this counter starts from 0. So if the reducer not increments the counter, it will return 0.
ReplyDelete
Replies
CECUEGFebruary 24, 2013 at 6:26 AM
Hi, I want to ask a question not related directly to this article .. I am new to Hadoop and wanted to ask how I can call Hadoop-based jobs externally from a machine not in the hadoop cluster (not a namenode or a datanode).. is just including hadoop libraries/jars and setting the configuration object with hdfs and jobtracker urls will do the job??
ReplyDelete
Replies
Thomas JungblutFebruary 24, 2013 at 8:36 AM
Hi, exactly like you told. Or if you more the XML kind of guy you can copy the hdfs-site.xml/core-site.xml of your cluster that contains this information to the other server and use conf.addResource(...). The jars should reside on both sides.
ReplyDelete
Replies
NeOAxEsApril 19, 2013 at 1:24 PM
Hi

Thanks for the info you have shared. I have couple of quick questions:

1. Is the enum declared in Mapper/Reducer Class - since its their count which we need to monitor?
2. context. increment again called in mapper or reducer?
3. I am implementing this in a reducer, so even if one reducer instance runs, it will increment and exit the loop?

Thanks
M
ReplyDelete
Replies
Thomas JungblutApril 19, 2013 at 1:27 PM
Hi M,

to 1: The enum can be declared anywhere, but it must be accessible from the Controller class that submits the job, as well as the Mapper/Reducer class.

to 2: Exactly, the counters are incremented in the mapper and/or in the reducer

to 3: yes.
ReplyDelete
Replies
AnonymousSeptember 9, 2013 at 1:10 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMarch 30, 2015 at 11:42 AM
Hi, I found this blog really very helpful.

I am actually dealing with a twitter dataset, where I have information on how many time a particular user retweeted on another user's tweet.
So I have a row saying that,
1 -> 2 (20 times)
2 -> 1 (5 times)
I am writing a map reduce job to process this data and trying to aggregate this information in a way as:
1 -- 2 (25 times and relationship is mutual)
Can it be done using map reduce. The dataset is quite large and I am having a difficult way figuring this out.
Thanks.
ReplyDelete
Replies
UnknownApril 1, 2015 at 7:57 AM
Yes it is definetly possible in MapReduce. Write your own key class (WritableComparable) which treats 1,2 and 2,1 is the same key. So when you get the input to reducer you will be having (1,2) or (2,1) as key and their count as values. You can sum up for total tweets, and to identify Mutual or not you just need to check it values count (not sum) is more than one or not.
ReplyDelete
Replies

Add comment