RDD中的一些常用的算子操作

最新推荐文章于 2024-10-10 19:45:00 发布

原创最新推荐文章于 2024-10-10 19:45:00 发布 · 1.1k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#RDD #Action #transformation #api #常用算子

Spark 同时被 2 个专栏收录

31 篇文章

订阅专栏

spark学习专栏

23 篇文章

订阅专栏

本文详细介绍了Apache Spark中RDD的基本算子，包括Transformation和Action两大类，并通过具体示例展示了不同算子的功能与应用场景，帮助读者更好地理解Spark的运行机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 算子分类

RDD的算子分类可以分为2种，Transformation和Action类。
Transformation：根据数据集创建一个新的数据集，计算后返回一个新RDD；例如：Map将数据的每个元素经过某个函数计算后，返回一个新的分布式数据集。

Action：对数据集计算后返回一个数值value给驱动程序；例如：collect将数据集的所有元素收集完成返回给程序。

2 Transformation的特点

RDD中的所有转换都是延迟加载的，也就是说，它们并不会直接计算结果。相反的，它们只是记住这些应用到基础数据集（例如一个文件）上的转换动作。只有当发生一个要求返回结果给Driver的动作时，这些转换才会真正运行，即，遇到action才进行计算，否则不计算。这种设计让Spark更加有效率地运行。
这里大家可以思考一个问题，为什么采用这种方式就可以更高效呢？
（ persist (or cache) method 会在后面进行讲解）

3 进一步了解Transformation和Action

下面举一个例子更好的了解Transformation和Action，以及lazy的含义：

首先进行一个Transformation操作产生一个RDD  查看UI界面
scala> val a=sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

再进行一次map又会产生一个新的RDD ，这时候再次查看UI界面把并没有变化
scala> val b=a.map(x =>x*2)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

进行一个action操作 查看UI看看有什么变化呢？
scala> b.collect
 Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)

4常用的算子操作

map 、filter、 collect算子
有一定基础的情况下建议大家使用链式编程

scala> val a = sc.parallelize(1 to 10).map(x=>(x,1)).filter(x=>(x._1) > 5 ).collect
a: Array[(Int, Int)] = Array((6,1), (7,1), (8,1), (9,1), (10,1))

上面操作很简单`1 to 10`创建了1到10，10个数然后进行了一次map操作再进行一次过滤。 下面我们进行分解操作看看每一步的含义：

1.通过parallelize算子创建一个rdd
scala> val a = sc.parallelize(1 to 10)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24

2.进行map操作变成一个array
scala> val map = a.map((_,1))
map: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[9] at map at <console>:26

3.filter过滤操作
scala> val filterRdd = map.filter(x=>x._1 > 5)
filterRdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[10] at filter at <console>:28

4.打印结果
scala> val res = filterRdd.collect
res: Array[(Int, Int)] = Array((6,1), (7,1), (8,1), (9,1), (10,1))

flatMap 与map的区别

创建input.txt

[hadoop@hadoop data]$ cat input.txt 
hello java
hello hadoop
hello hive
hello sqoop
hello hdfs
hello spark

进行flatMap操作

1.读取input.txt
scala> val rdd = sc.textFile("file:///home/hadoop/data/input.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/input.txt MapPartitionsRDD[12] at textFile at <console>:24

2. 进行flatMap操作
val flatMapRdd = rdd.flatMap(x=>x.split(" "))
flatMapRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at flatMap at <console>:26
这时候我们从日志可以看出他是RDD[String]类型,我们打印出来看看结果

3.进行collect操作
scala> flatMapRdd.collect
res1: Array[String] = Array(hello, java, hello, hadoop, hello, hive, hello, sqoop, hello, hdfs, hello, spark)

从结果我们可以看出它是一个Array[String]类型的数组，给我们进行了扁平化把多行变成了一行。 
也就是map操作以前有几行数据现在就有几个数组，flatMap会把数据放在一个集合中，也就是扁平化的过程~~

进行map操作

scala> val rdd = sc.textFile("file:///home/hadoop/data/input.txt")
scala> rdd.map(x=>x.split(" ")).collect
res28: Array[Array[String]] = Array(Array(hello, java), Array(hello, hadoop), Array(hello, hive), Array(hello, sqoop), Array(hello, hdfs), Array(hello, spark))

map操作我们应该很熟悉了，变成了一个Array[Array[String]]类型的数组。
从结果对比看还是很明显的，想必大家应该能看出区别了把，map操作后数组中放入了多个数组，每行数据就是一个数组。
并没有扁平化的过程。

mapValues算子

scala> val a = sc.parallelize(List("zhangsan","lisi","wangwu","zhaoliu"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[17] at parallelize at <console>:24

scala> val mapRdd = a.map(x=>(x.length,x))
mapRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[18] at map at <console>:26

scala> mapRdd.collect
res5: Array[(Int, String)] = Array((8,zhangsan), (4,lisi), (6,wangwu), (7,zhaoliu))

到这一步大家应该可以看懂的，map操作就是key为字符串的长度，value为原来的字符串

我们进行一个mapValues操作
scala> mapRdd.mapValues("hello" +" " +_).collect
相信大家也能猜到这个算子是啥意思，让我们来看看结果把：
res7: Array[(Int, String)] = Array((8,hello zhangsan), (4,hello lisi), (6,hello wangwu), (7,hello zhaoliu))

很明显mapValues是对value操作key不动，即前面(key)不动后面(value)动哈，

join操作

scala> val a = sc.parallelize(Array(("A","a1"),("B","b1"),("C","c1"),("d","d1")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[37] at parallelize at <console>:26

scala> val b = sc.parallelize(Array(("A","a1"),("B","b2"),("C","c1"),("C","c2")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[38] at parallelize at <console>:26

1.我们先来看看join的结果
res23: Array[(String, (String, String))] = Array((B,(b1,b2)), (A,(a1,a1)), (C,(c1,c1)), (C,(c1,c2)))
从结果看和我们常规的join没啥区别相当于一个inner join，只匹配左右都有的

2.rightOuterJoin
scala> a.rightOuterJoin(b).collect
res25: Array[(String, (Option[String], String))] = Array((B,(Some(b1),b2)), (A,(Some(a1),a1)), (C,(Some(c1),c1)), (C,(Some(c1),c2)))
以右表为主（b）进行匹配

3.leftOuterJoin
scala> a.leftOuterJoin(b).collect
res26: Array[(String, (String, Option[String]))] = Array((d,(d1,None)), (B,(b1,Some(b2))), (A,(a1,Some(a1))), (C,(c1,Some(c1))), (C,(c1,Some(c2))))
以左表为主，没有匹配到的显示None

4.fullOuterJoin
scala> a.fullOuterJoin(b).collect
res27: Array[(String, (Option[String], Option[String]))] = Array((d,(Some(d1),None)), (B,(Some(b1),Some(b2))), (A,(Some(a1),Some(a1))), (C,(Some(c1),Some(c1))), (C,(Some(c1),Some(c2))))
全连接，左右表都有，没有的显示None

总结：就是常规的join操作，看到结果相信大家都能明白，但是要注意的是输出结果的数据结构，这对我们后续的操作尤为关键，你只有知道它的数据结构了才能更好的进行后续操作

subtract算子
我们从源码中看看这个算子到底是什么意思：
源码地址：https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be &lt;= us.
   */
这一段注解还是很简单的就是在A中的元素不在B中
```shell
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val b = sc.parallelize(Array(1,3))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> a.subtract(b).collect
res1: Array[Int] = Array(2, 4, 5)

intersection算子

 /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

就是两个集合做交集


scala> a.intersection(b).collect
res3: Array[Int] = Array(1, 3)

cartesian 算子

  /**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

做笛卡儿积

scala> a.cartesian(b).collect
res4: Array[(Int, Int)] = Array((1,1), (2,1), (1,3), (2,3), (3,1), (4,1), (5,1), (3,3), (4,3), (5,3))

常用的action操作

reduce

scala> val a = sc.parallelize(1 to 10)
scala> a.reduce((x,y)=>(x+y))
res8: Int = 55

看到结果大家应该就知道什么意思了就是所有元素相加，其实还有简单的写法：
scala> a.reduce(_+_)也是可以的

scala> sc.parallelize(1 to 10).reduce(_-_)
res21: Int = -15

sum

scala> a.sum
res10: Double = 55.0
默认给我们转化为double类型 ，我当然我们也可以转化为int类型
scala> a.sum.toInt
res12: Int = 55

first、take

first算子是取第一个元素
scala> a.first
res13: Int = 1

take算子是取指定的元素
scala> a.take(1)
res15: Array[Int] = Array(1)

从结果看两者是由区别的哦

top

scala> val a = sc.parallelize(Array(3,7,2,10,5,1))
scala> a.top(3)
res17: Array[Int] = Array(10, 7, 5)
取最大的三个数，内部给我们做了排序的，那如果是字符呢？


scala> val a = sc.parallelize(List("zhangsan","lisi","wangwu","zhaoliu")).top(3)
a: Array[String] = Array(zhaoliu, zhangsan, wangwu)

scala> val a = sc.parallelize(List("a","bb","ccc","dddd")).top(3)
a: Array[String] = Array(dddd, ccc, bb)

scala> val a = sc.parallelize(List("a","b","c","d")).top(3)
a: Array[String] = Array(d, c, b)

scala> val a = sc.parallelize(List("a","aa","aaa","aaaa")).top(3)
a: Array[String] = Array(aaaa, aaa, aa)

从上面的例子我们可以看出如果是数字类型会按照降序给我排序并取出top（n），如果是字符串会按照字符顺序进行排序。
到这里有同学可能问了那如果想进行升序排序怎么办呢？

这里就涉及到一个隐式转换了哈,大家学习scala就知道了很多地方都使用到：

scala> implicit val myOrder = implicitly[Ordering[Int]].reverse
myOrder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@774d07de

scala> sc.parallelize(Array(3,5,10,7,1,6)).top(3)
res20: Array[Int] = Array(1, 3, 5)

5 常用的Transformation

方法	含义
map(func)	返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成
filter(func)	返回一个新的RDD，该RDD由经过func函数计算后返回值为true的输入元素组成
flatMap(func)	类似于map，但是每一个输入元素可以被映射为0或多个输出元素（所以func应该返回一个序列，而不是单一元素）
mapPartitions(func)	类似于map，但独立地在RDD的每一个分片上运行，因此在类型为T的RDD上运行时，func的函数类型必须是Iterator[T] => Iterator[U]
mapPartitionsWithIndex(func)	类似于mapPartitions，但func带有一个整数参数表示分片的索引值，因此在类型为T的RDD上运行时，func的函数类型必须是(Int, Interator[T]) => Iterator[U]
sample(withReplacement, fraction, seed)	根据fraction指定的比例对数据进行采样，可以选择是否使用随机数进行替换，seed用于指定随机数生成器种子
union(otherDataset)	对源RDD和参数RDD求并集后返回一个新的RDD
intersection(otherDataset)	对源RDD和参数RDD求交集后返回一个新的RDD
distinct([numTasks]))	对源RDD进行去重后返回一个新的RDD
groupByKey([numTasks])	在一个(K,V)的RDD上调用，返回一个(K, Iterator[V])的RDD
reduceByKey(func, [numTasks])	在一个(K,V)的RDD上调用，返回一个(K,V)的RDD，使用指定的reduce函数，将相同key的值聚合到一起，与groupByKey类似，reduce任务的个数可以通过第二个可选的参数来设置
sortByKey([ascending], [numTasks])	在一个(K,V)的RDD上调用，K必须实现Ordered接口，返回一个按照key进行排序的(K,V)的RDD
sortBy(func,[ascending], [numTasks])	与sortByKey类似，但是更灵活
join(otherDataset, [numTasks])	在类型为(K,V)和(K,W)的RDD上调用，返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD
repartition(numPartitions)	重新给 RDD 分区

6 常用的Action

动作	含义
reduce(func)	通过func函数聚集RDD中的所有元素，这个功能必须是课交换且可并联的
collect()	在驱动程序中，以数组的形式返回数据集的所有元素
count()	返回RDD的元素个数
first()	返回RDD的第一个元素（类似于take(1)）
take(n)	返回一个由数据集的前n个元素组成的数组
takeSample(withReplacement,num, [seed])	返回一个数组，该数组由从数据集中随机采样的num个元素组成，可以选择是否用随机数替换不足的部分，seed用于指定随机数生成器种子
takeOrdered(n, [ordering])	返回自然顺序或者自定义顺序的前 n 个元素
saveAsTextFile(path)	将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统，对于每个元素，Spark将会调用toString方法，将它装换为文件中的文本
saveAsSequenceFile(path)	将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下，可以使HDFS或者其他Hadoop支持的文件系统。
saveAsObjectFile(path)	将数据集的元素，以 Java 序列化的方式保存到指定的目录下
countByKey()	针对(K,V)类型的RDD，返回一个(K,Int)的map，表示每一个key对应的元素个数。
foreach(func)	在数据集的每一个元素上，运行函数func进行更新。