初始化SparkSession,底层初始化sparkContext,相当于Spark任务的入口。
val sparkSession = SparkSession.builder().appName("sparktest").master("local[2]").getOrCreate()
val sparkContext = sparkSession.sparkContext
val log = sparkContext.textFile("/Users/mike/Desktop/test.txt")
RDD -> DataFrame
- :case class
case class Person(name:String,age:Int)
//转化为RDD[Person]
val rowRDD = log.map(_.split(" ")).map(x => Person(x(0),x(1).toInt))
import sparkSession.implicits._
val df = rowRDD.toDF()
df.show()
sparkSession.stop()
- schema(StructType的方式)
val schemaFiled = "name,age"
val schemaString = schemaFiled.split(",")
val schema = StructType(
Array(
StructField(schemaString(0), StringType, nullable = true),
StructField(schemaString(1), IntegerType, nullable = true)
)
)
//生成RDD[Row]
val rowRDD = log.map(_.split(" ")).map(x => Row(x(0), x(1).toInt))
//格式和RDD 构造DataFrame
val df = sparkSession.createDataFrame(rowRDD, schema)
df.show()
sparkSession.stop()
RDD -> Dataset
toDS()方法将RDD转化为DataSet
DataFrame <-> Dataset
- toDF()将Dataset转化为dataFrame
- df.as[Person]将DataFrame转化为DataSet
xxx -> RDD
xxx.rdd()将其他格式的数据转化为RDD