1 Environment
1.1 getExecutionEnvironment
创建一个执行环境,表示当前执行程序的上下文。如果程序是独立调用的,则此方法返回本地执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境,也就是说,getExecutionEnvironment会根据查询运行的方式决定返回什么样的运行环境,是最常用的一种创建执行环境的方式。
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// 流式数据执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
如果没有设置并行度,会以flink-conf.yaml中的配置为准,默认是1。
1.2 Source
- 从集合读取数据
// Source: 从集合Collection中获取数据
DataStream<SensorReading> dataStream = env.fromCollection(
Arrays.asList(
new SensorReading("sensor_1", 1547718199L, 35.8),
new SensorReading("sensor_6", 1547718201L, 15.4),
new SensorReading("sensor_7", 1547718202L, 6.7),
new SensorReading("sensor_10", 1547718205L, 38.1)
)
);
- 从文件读取数据
// 从文件中获取数据输出
DataStream<String> dataStream = env.readTextFile("/src/main/resources/sensor.txt");
- 从Kafka读取数据
1 pom依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Flink_Tutorial</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<flink.version>1.12.1</flink.version>
<scala.binary.version>2.12</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
</project>
2 启动zookeeper
$ bin/zookeeper-server-start.sh config/zookeeper.properties
3 启动kafka服务
$ bin/kafka-server-start.sh config/server.properties
4 启动kafka生产者
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic sensor
5 编写java代码
public class SourceTest3_Kafka {
public static void main(String[] args) throws Exception {
// 创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并行度1
env.setParallelism(1);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// 下面这些次要参数
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// flink添加外部数据源
DataStream<String> dataStream = env.addSource(new FlinkKafkaConsumer<String>("sensor", new SimpleStringSchema(),properties));
// 打印输出
dataStream.print();
env.execute();
}
}
6 运行java代码,在Kafka生产者console中输入
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic sensor
>sensor_1,1547718199,35.8
>sensor_6,1547718201,15.4
>
- 自定义Source
addSource 函数
DataStream<SensorReading> dataStream = env.addSource(new MySensorSource());
// 实现自定义的SourceFunction
public static class MySensorSource implements SourceFunction<SensorReading> {
// 标示位,控制数据产生
private volatile boolean running = true;
@Override
public void run(SourceContext<SensorReading> ctx) throws Exception {
//定义一个随机数发生器
Random random = new Random();
// 设置10个传感器的初始温度
HashMap<String, Double> sensorTempMap = new HashMap<>();
for (int i = 0; i < 10; ++i) {
sensorTempMap.put("sensor_" + (i + 1), 60 + random.nextGaussian() * 20);
}
while (running) {
for (String sensorId : sensorTempMap.keySet()) {
// 在当前温度基础上随机波动
Double newTemp = sensorTempMap.get(sensorId) + random.nextGaussian();
sensorTempMap.put(sensorId, newTemp);
ctx.collect(new SensorReading(sensorId,System.currentTimeMillis(),newTemp));
}
// 控制输出评率
Thread.sleep(2000L);
}
}
@Override
public void cancel() {
this.running = false;
}
}
1.3 Transform
- 基本转换算子(map/flatMap/filter)
// 1. map, String => 字符串长度INT
DataStream<Integer> mapStream = dataStream.map(new MapFunction<String, Integer>() {
@Override
public Integer map(String value) throws Exception {
return value.length();
}
});
// 2. flatMap,按逗号分割字符串
DataStream<String> flatMapStream = dataStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] fields = value.split(",");
for(String field:fields){
out.collect(field);
}
}
});
// 3. filter,筛选"sensor_1"开头的数据
DataStream<String> filterStream = dataStream.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.startsWith("sensor_1");
}
});
- 聚合操作算子
DataStream里没有reduce和sum这类聚合操作的方法,因为Flink设计中,所有数据必须先分组才能做聚合操作。
先keyBy得到KeyedStream,然后调用其reduce、sum等聚合操作方法。(先分组后聚合)
常见的聚合操作算子主要有:
keyBy
滚动聚合算子Rolling Aggregation
reduce
KeyBy
1、KeyBy会重新分区;
2、不同的key有可能分到一起,因为是通过hash原理实现的;
Rolling Aggregation
这些算子可以针对KeyedStream的每一个支流做聚合。
sum()
min()
max()
minBy()
maxBy()
// 先分组再聚合
// 分组
KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);
// 滚动聚合,max和maxBy区别在于,maxBy除了用于max比较的字段以外,其他字段也会更新成最新的,而max只有比较的字段更新,其他字段不变
DataStream<SensorReading> resultStream = keyedStream.maxBy("temperature");
reduce
Reduce适用于更加一般化的聚合操作场景。java中需要实现ReduceFunction函数式接口。
例如:在前面Rolling Aggregation的前提下,对需求进行修改。获取同组历史温度最高的传感器信息,同时要求实时更新其时间戳信息。
// 先分组再聚合
// 分组
KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);
// reduce,自定义规约函数,获取max温度的传感器信息以外,时间戳要求更新成最新的
DataStream<SensorReading> resultStream = keyedStream.reduce(
(curSensor,newSensor)->new SensorReading(curSensor.getId(),newSensor.getTimestamp(), Math.max(curSensor.getTemperature(), newSensor.getTemperature()))
);
3 多流转换算子
多流转换算子一般包括:
Split和Select (新版已经移除)
Connect和CoMap
Union
Split和Select:分离
注:新版Flink已经不存在Split和Select这两个API了(至少Flink1.12.1没有!) 需要使用getSideOutput
split&select:将一个DataStream拆分成多个DataStream
使用getSideOutput实现该功能
首先要定义要分的类
// 定义getSideOutput 的分类
private static final org.apache.flink.util.OutputTag<SensorReading> high = new org.apache.flink.util.OutputTag<SensorReading>("high"){
};
private static final org.apache.flink.util.OutputTag<SensorReading> low = new org.apache.flink.util.OutputTag<SensorReading>("low"){
};
// Spilt 分流
SingleOutputStreamOperator<SensorReading> SplitSensorReading = DataSensorReadingmap.process(new ProcessFunction<SensorReading, SensorReading>() {
@Override
public void processElement(SensorReading sensorReading, Context context, Collector<SensorReading> collector) throws Exception {
// 根据温度划分等级
if (sensorReading.getTemperature()>30) {
context.output(high, sensorReading);
}else if(sensorReading.getTemperature()<=30){
context.output(low,sensorReading);
}else{
collector.collect(sensorReading);
}
}
});
SplitSensorReading.getSideOutput(high).print("high");
SplitSensorReading.getSideOutput(low).print("low");
Connect和CoMap:连接
Connect:
DataStream,DataStream -> ConnectedStreams:
连接两个保持他们类型的数据流,两个数据流被Connect之后,只是被放在了一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
CoMap:
ConnectedStreams -> DataStream:
作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams 中的每一个Stream分别进行map和flatMap操作;
// connect 将高温流转换二元组类型,与低温流连接合并之后,输出状态信息
SingleOutputStreamOperator<Tuple2<String,Double>> HighTemperatureWarning = SplitSensorReading.getSideOutput(high).map(new MapFunction<SensorReading, Tuple2<String,Double>>() {
@Override
public Tuple2<String, Double> map(SensorReading sensorReading) throws Exception {
return new Tuple2<>(sensorReading.getId(),sensorReading.getTemperature());
}
});
// 合并 高温流转换二元组类型 与 低温流SensorReading类型
ConnectedStreams<Tuple2<String, Double>, SensorReading> sensorReadingConnectedStreams = HighTemperatureWarning.connect(SplitSensorReading.getSideOutput(low));
// CoMap 对每一个Stream分别进行map和flatMap操作
SingleOutputStreamOperator<Object> ResultStream = sensorReadingConnectedStreams.map(new CoMapFunction<Tuple2<String, Double>, SensorReading, Object>() {
@Override
public Object map1(Tuple2<String, Double> stringDoubleTuple2) throws Exception {
return new Tuple3<>(stringDoubleTuple2.f0, stringDoubleTuple2.f1, "high temp warning");
}
@Override
public Object map2(SensorReading sensorReading) throws Exception {
return new Tuple2<>(sensorReading.getId(), sensorReading.getTemperature());
}
});
ResultStream.print();
Union:联合(多个)
DataStream ->DataStream:
对两个或者两个以上的DataStream进行Union操作,产生一个包含多有DataStream元素的新DataStream。
Union与Connect的区别:
1Connect 的数据类型可以不同,Connect 只能合并两个流;
2Union可以合并多条流,Union的数据结构必须是一样的;
// union 联合多条
DataStream<SensorReading> UnionSensorReadingDataStream = SplitSensorReading.getSideOutput(high).union(SplitSensorReading.getSideOutput(low));
UnionSensorReadingDataStream.print("Union");
1.4总结
Transformation算子就是将一个或多个DataStream转换为新的DataStream
如上图所示,DataStream会由不同的Transformation操作,转换、过滤、聚合成其他不同的流,从而完成我们的业务要求。