Scala 开发简单mapreduce 程序_scala写的mapreduce-CSDN博客

本文链接：https://blog.csdn.net/hopeatme/article/details/52655576

本文介绍如何使用Scala开发MapReduce程序，尽管Java更为常见，但Scala的简洁语法提供了更好的编程体验。主要内容包括创建mapper、reducer和driver类，特别强调了Scala中的companion object在实现driver类中的作用。在编写完成后，文章提到了编译和执行时的注意事项，如确保使用新API、打包时包含scala-library.jar以及设置HADOOP_CLASSPATH。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

看到这篇文章，肯定会有人问，“为什么要用scala来写MR， java写不是更自然？” 关于这个我问题，我个人的原因是： scala 写代码很简洁，而且我很享受这种体验。对于其它scala程序员来说，可能是因为所有项目都是用scala写，而且写scala更熟练些。对于这部分人非得用MR来解决的问题，那他一定有很充分的理由。

归正题吧，下面讲讲用如何上手操作吧！！鉴于大家使用的IDE和编译工具不一样，这里就不多说，用自己熟悉的就好。

对于简单的MR程序，将小文件合并成大文件，说他简单是因为没有复杂的map和reduce计算算法。以这个简单的示例主要讲一下scala写mr的语法，其它和java写是一样的。首先，需要有三个类： mapper计算类，reduce计算类，和一个driver类

下面先给一个自己写的Mapper类的代码：

class SmallFileMapper extends Mapper[LongWritable, Text ,Text, Text]{

  @throws(classOf[IOException])
  @throws(classOf[InterruptedException])
   override def map(key : LongWritable, value : Text, context : Mapper[LongWritable,Text,Text,Text]#Context) : Unit = {
    val inputSplit : InputSplit= context.getInputSplit
    val filename : String = inputSplit.asInstanceOf[FileSplit].getPath.getName
    val word = new Text()
    word.set(filename)
    context.write(word, value)
  }

}

上面代码实现的mapper计算：将每条记录转换为新的map , key 是文件名， value 是原始的一条记录。大家只需要关注一个Mapper代码的结构及写法，具体上手时根据自己的计算逻辑编写代码。

再给一个自己写的reduce类的代码：

class SmallFileReducer extends Reducer[Text, Text,Text, Text]{
   override def reduce(key : Text , values : Iterable[Text] , context : Reducer[Text,Text,Text,Text]#Context) :Unit = {
    values.foreach(context.write(key, _))
  }

}

对于这个reduce 直接将Map输出的数据写回context 。

下面是如何写这个driver类，在贴代码之前，我想提一下scala中companion object ，大家可以加大头翻一下相关手册。我会用一个class类和同名的object来实现这个driver ，在class中继承相应接口，而object中main 作为程序的入口。由于篇幅，就不贴完整代码，只贴一下scala写driver的关键示例，大家可以参考上手写一下：

class  SmallFileMerger  extends Configured with Tool{
  def run(args : Array[String]) : Int = {
    val conf : Configuration  = getConf
    val args2 = new GenericOptionsParser(conf , args)
                        .getRemainingArgs

SmallFileMerger是这个driver类，需要实现run方法，下面是run方法中定义job的代码。

val job : Job = Job.getInstance(conf)
job.setJobName("SmallFilesMerger")
job.setJarByClass(this.getClass)


job.setOutputKeyClass(classOf[Text])
job.setOutputValueClass(classOf[Text])

job.setMapperClass(classOf[SmallFileMapper])
job.setReducerClass(classOf[SmallFileReducer])


job.setNumReduceTasks(reduceCounts)

FileInputFormat.addInputPath(job , inputPath)
FileOutputFormat.setOutputPath(job , outputPath)
FileOutputFormat.setCompressOutput(job , true)
FileOutputFormat.setOutputCompressorClass(job , classOf[GzipCodec])

if(job.waitForCompletion(true)) 0 else 1

上面在reduce输出设置使用gzip压缩。

下面是conpanion object ，这个就比较简单了，只需要写main方法，如下：

object SmallFileMerger {
  def main(args : Array[String]) : Unit = {
    val sfm = new SmallFileMerger()
    val exitCode = ToolRunner.run(sfm , args)