Flink-常用Source与Sink的使用汇总整理
基础结构
object Source2Sink {
def main(args: Array[String]): Unit = {
//获取环境对象
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//设置并行度,默认为计算机线程数
env.setParallelism(1)
/*
##############
从Source获取数据
##############
*/
/*#############
对数据的转化操作
#############
*/
/*
############
数据输出到Sink
############
*/
//执行
env.execute()
}
}
Source
数据源是什么?
文件、端口、本地、Kafka-Source、自定义source-Mysql
本地数据源
val localCollectionSource: DataStream[String] = env.fromCollection(List("aa bb", "bb cc", "cc dd", "aa aa"))
val localSequenceSource: DataStream[Long] = env.fromSequence(1, 100)
val localElementsSource: DataStream[String] = env.fromElements("aa bb cc dd ee", "aa bb cc dd ee", "aa bb cc")
端口数据源
val socketSource: DataStream[String] = env.socketTextStream("master", 6666)
文件数据源
val FileSource: DataStream[String] = env.readTextFile("src/main/resources/wc.txt")
Kafka-Source
val props = new Properties()
props.setProperty("bootstrap.servers","master:9092")
props.setProperty("group.id","wanKafkaSourceTest")
val kafkaSource: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("test", new SimpleStringSchema(), props))
自定义JDBC-Source
val jdbc_Source: DataStream[user] = env.addSource(new jdbcSource)
//自定义Source类继承自RichParallelSourceFunction
class jdbcSource extends RichParallelSourceFunction[user]{
var conn:Connection =_
var statement:PreparedStatement = _
var flag:Boolean = true
override def open(parameters: Configuration): Unit = {
conn=DriverManager.getConnection("jdbc:mysql://localhost:3306/test?characterEncoding=utf-8&serverTimezone=UTC&useSSL=false","root","123456")
statement = conn.prepareStatement("select * from user")
}
override def run(ctx: SourceFunction.SourceContext[user]): Unit = {
val resultSet: ResultSet = statement.executeQuery()
while (resultSet.next()) {
ctx.collect(user(resultSet.getInt(1),resultSet.getString(2)))
}
}
override def cancel(): Unit = {
}
override def close(): Unit = {
if(statement!=null)statement.close()
if(conn!=null)conn.close()
}
}
数据处理
//数据转换 WordCount
val result: DataStream[(String, Int)] = FileSource
.flatMap(_.split(" "))
.map((_, 1))
.keyBy(_._1)
.sum(1)
SingleDataStream
map、filter、flatMap、keyBy、reduce、aggregations…
MultiDataStream
-
union
对两个或者两个以上的 DataStream 进行合并操作
需要保证两个数据集的格式是一致的
-
connect
两份数据流被 Connect 之后,只是被放在了同一个流中
内部依然保持各自的数据和形式不发生变化,两份数据相互独立
-
coMap、coFlatMap
对ConnectedStreams进行map和flatmap
-
Split+Select(之前版本) => 侧输出流
通过定义OutputTag标记侧输出流
侧输出流
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//定义侧输出流标签
val oddOutputTag = new OutputTag[Int]("odd")
val result: DataStream[Int] = env.fromElements(1, 2, 3, 4, 5, 6, 7, 8, 9)
//侧输出流
.process(new ProcessFunction[Int, Int] {
override def processElement(value: Int, ctx: ProcessFunction[Int, Int]#Context, out: Collector[Int]): Unit = {
if (value % 2 == 0) {
//输出到主流
out.collect(value)
} else {
//输出到侧出流
ctx.output(oddOutputTag, value)
}
}
})
//主流
result.print("偶数:")
//从主流中获取侧输出流
result.getSideOutput(oddOutputTag).print("奇数")
env.execute()
}
分区算子
-
随机分区
dataStream.shuffle
-
循环分区
dataStream.rebalance
-
调节分区
dataStream.rescale
-
数据发往同一个分区
dataStream.global
-
自定义分区
继承Partitioner,实现partition方法
自定义分区器定义好了以后,需要调用partitionCustom来应用分区器,并指定分区器使用到的字段
dataStream.partitionCustom(customPartitioner,”filed_name”) dataStream.partitionCustom(customPartitioner,0)
UDF
Function
Flink 包含了各类算子实现UDF函数的抽象类或者接口: MapFunction, FilterFunction, ProcessFunction 等等
RichFunction
-
可以获取运行环境的上下文
getRuntimeContext()获取运行时上下文,例如函数执行的并行度,任务的名字,以及 state 状态等
etRuntimeContext()设置运行时上下文
-
拥有生命周期方法
open()初始化方法,当一个算子被调用之前 open()会被调用
close()生命周期中最后调用的方法,做一些清理工作
如:RichMapFunction、RichFlatMapFunction
Sink
处理完的数据去哪?
文件、端口、Kafaka-Sink、自定义Sink-Mysql
本地Sink
result.print()
localSequenceSource.print()
端口Sink
result.map(x=>s"${x._1}_${x._2}\n").writeToSocket("slave1",6666,new SimpleStringSchema())
/*
def writeToSocket(
hostname: String,
port: Integer,
//schema 泛型,一个参数
schema: SerializationSchema[T]): DataStreamSink[T] = {
stream.writeToSocket(hostname, port, schema)
}
*/
文件Sink
result.writeAsText("src/main/resources/result.txt")
Kafka-Sink
val inputData: DataStream[String] = result.map(x => s"${x._1} ${x._2}")
val kafkaSink = new FlinkKafkaProducer[String]("test1", new SimpleStringSchema(), props)
inputData.addSink(kafkaSink)
自定义JDBC-Sink
jdbc_Source
.map(x=>user(x.id,x.name))
.addSink(new jdbcSink)
//自定义Sink类继承自RichParallelSourceFunction
class jdbcSink extends RichSinkFunction[user]{
var conn:Connection =_
var updateStatement:PreparedStatement=_
var insertStatement:PreparedStatement=_
override def open(parameters: Configuration): Unit = {
conn=DriverManager.getConnection("jdbc:mysql://localhost:3306/test?characterEncoding=utf-8&serverTimezone=UTC&useSSL=false","root","123456")
updateStatement=conn.prepareStatement("update user_copy set name = ? where id = ?")
insertStatement=conn.prepareStatement("insert into user_copy values(?,?)")
}
override def invoke(value: user, context: SinkFunction.Context): Unit = {
updateStatement.setString(1,value.name)
updateStatement.setInt(2,value.id)
updateStatement.executeUpdate()
if(updateStatement.getUpdateCount==0){
insertStatement.setInt(1,value.id)
insertStatement.setString(2,value.name)
insertStatement.executeUpdate()
}
}
override def close(): Unit = {
if(insertStatement!=null)insertStatement.close()
if(updateStatement!=null)updateStatement.close()
if(conn!=null)conn.close()
}
}