Spark-算子之shuffle类

最新推荐文章于 2024-01-08 15:26:22 发布

原创最新推荐文章于 2024-01-08 15:26:22 发布 · 334 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark

本文深入探讨Spark中RDD的GroupByKey操作，解析其内部机制，包括数据如何在Map端进行局部聚合，在Reduce端进行全局聚合，以及如何通过分区规则影响数据分布。同时，对比GroupByKey与其他聚合操作如reduceByKey和aggregateByKey的性能差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在spark中存在隐式转换将RDD转换成PairFunctionRDD

groupByKey

这个方法生成新的stage,并且源代码是写在PairRDDFunctions中,不是调用新建一个MapPartitionsRDD了,这个方法是map端和reduce端得交互,map端处理完数据会先将数据ShuffleWrite溢写到map端得磁盘,然后reduce端通过网络进行拉取ShuffleRead过来

之于为什么OutPut会变少,是因为输出的文件 key都合并了没有之前那么多key了但是value还是之前那么多

还有一个问题值得讨论,就是怎么定的分区规则,其实如果不传一个分区方法,是有默认的分区方法的,就是HashPartitioner,和mapreduce用的是一样的方法.这个方法会跟去你传入分区数的最大值决定分几个区和决定他决定向谁取模

这个方法就是传进去一个分区器

package com.doit.spark.restart

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GroupByKey {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val words = sc.parallelize(List(
      "spark", "flink", "hadoop", "kafka",
      "spark", "hive", "hadoop", "kafka",
      "spark", "flink", "hive", "kafka",
      "spark", "flink", "hadoop", "kafka"
    ), 4)
    val wordAndOne: RDD[(String, Int)] = words.map((_, 1))
    val value: RDD[(String, Iterable[Int])] = wordAndOne.groupByKey(6)
    value.saveAsTextFile("zhao")
    sc.stop()
  }
}


p0  (kafka,CompactBuffer(1, 1, 1, 1))
p1  (hadoop,CompactBuffer(1, 1, 1))
p2  (hive,CompactBuffer(1, 1))
p3  啥也没有
p4  (flink,CompactBuffer(1, 1, 1))
p5  (spark,CompactBuffer(1, 1, 1, 1))

/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with into `numPartitions` partitions. The ordering of elements within
   * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))


/**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }


/**
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
 * Java's `Object.hashCode`.
 *
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
 * produce an unexpected or incorrect result.
 */
class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2

这个方法的实现主要是new了个 ShuffleRDD,第一步先在map端创建了个CompactBuffer(v),然后局部聚合向CompactBuffer(v)加入同一分区内相同key的所有value,再到reduce端全局聚合加入所有value

groupBy

这个方法需要传两个参数,一个是分组方法,一个是分区器,最终得到的是(调用函数之后的结果,compactBuffer(未经处理的样子))再进行groupByKey

def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

package com.doit.spark.restart

import org.apache.spark.{SparkConf, SparkContext}

object GroupBy {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val words = sc.parallelize(List(
      "spark", "flink", "hadoop", "kafka",
      "spark", "hive", "hadoop", "kafka",
      "spark", "flink", "hive", "kafka",
      "spark", "flink", "hadoop", "kafka"
    ), 4)
    val wordAndOne = words.map((_, 1))
    val grouped = wordAndOne.groupBy(_._1)
    //用GroupByKey实现GroupBy
    wordAndOne.map(x =>(x._1,x)).groupByKey()
  }
}

Distinct

这个方法是用来去重的, List(1, 1, 2, 2, 2, 3, 4, 5, 4, 3, 2, 4, 2, 5)=>(1,2,3,4,5)

      case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1)

这个是实现方法,先把数据转成二维元组,然后通过调用reduceByKey传进去的函数只进行拿出每个key的第一个元组不进行处理,原本的reduceByKey传入和函数常为_+_ 是reduce这个方法底层在map端和reduce分别进行了两次合并的事先,就是数值相加,这次只调出一个value也不进行相加,所以就只有null,再取出每个不同key元组的key 完成去重