4. Flink流处理API

1 Environment

在这里插入图片描述

1.1 getExecutionEnvironment

​ 创建一个执行环境,表示当前执行程序的上下文。如果程序是独立调用的,则此方法返回本地执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境,也就是说,getExecutionEnvironment会根据查询运行的方式决定返回什么样的运行环境,是最常用的一种创建执行环境的方式。

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 
// 流式数据执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); 

如果没有设置并行度,会以flink-conf.yaml中的配置为准,默认是1。

1.2 Source

  1. 从集合读取数据
// Source: 从集合Collection中获取数据
        DataStream<SensorReading> dataStream = env.fromCollection(
                Arrays.asList(
                        new SensorReading("sensor_1", 1547718199L, 35.8),
                        new SensorReading("sensor_6", 1547718201L, 15.4),
                        new SensorReading("sensor_7", 1547718202L, 6.7),
                        new SensorReading("sensor_10", 1547718205L, 38.1)
                )
        );
  1. 从文件读取数据
// 从文件中获取数据输出
       DataStream<String> dataStream = env.readTextFile("/src/main/resources/sensor.txt");
  1. 从Kafka读取数据
    1 pom依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>Flink_Tutorial</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <flink.version>1.12.1</flink.version>
        <scala.binary.version>2.12</scala.binary.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!-- kafka -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
    </dependencies>
</project>

2 启动zookeeper

$ bin/zookeeper-server-start.sh config/zookeeper.properties

3 启动kafka服务

$ bin/kafka-server-start.sh config/server.properties

4 启动kafka生产者

$ bin/kafka-console-producer.sh --broker-list localhost:9092  --topic sensor

5 编写java代码

public class SourceTest3_Kafka {

    public static void main(String[] args) throws Exception {
        // 创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 设置并行度1
        env.setParallelism(1);

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "localhost:9092");
        // 下面这些次要参数
        properties.setProperty("group.id", "consumer-group");
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset", "latest");

        // flink添加外部数据源
        DataStream<String> dataStream = env.addSource(new FlinkKafkaConsumer<String>("sensor", new SimpleStringSchema(),properties));

        // 打印输出
        dataStream.print();

        env.execute();
    }
}

6 运行java代码,在Kafka生产者console中输入

$ bin/kafka-console-producer.sh --broker-list localhost:9092  --topic sensor
>sensor_1,1547718199,35.8
>sensor_6,1547718201,15.4
>
  1. 自定义Source
    addSource 函数
 DataStream<SensorReading> dataStream = env.addSource(new MySensorSource());
 // 实现自定义的SourceFunction
public static class MySensorSource implements SourceFunction<SensorReading> {

        // 标示位,控制数据产生
        private volatile boolean running = true;


        @Override
        public void run(SourceContext<SensorReading> ctx) throws Exception {
            //定义一个随机数发生器
            Random random = new Random();

            // 设置10个传感器的初始温度
            HashMap<String, Double> sensorTempMap = new HashMap<>();
            for (int i = 0; i < 10; ++i) {
                sensorTempMap.put("sensor_" + (i + 1), 60 + random.nextGaussian() * 20);
            }

            while (running) {
                for (String sensorId : sensorTempMap.keySet()) {
                    // 在当前温度基础上随机波动
                    Double newTemp = sensorTempMap.get(sensorId) + random.nextGaussian();
                    sensorTempMap.put(sensorId, newTemp);
                    ctx.collect(new SensorReading(sensorId,System.currentTimeMillis(),newTemp));
                }
                // 控制输出评率
                Thread.sleep(2000L);
            }
        }

        @Override
        public void cancel() {
            this.running = false;
        }
    }

1.3 Transform

  1. 基本转换算子(map/flatMap/filter)
 		// 1. map, String => 字符串长度INT
        DataStream<Integer> mapStream = dataStream.map(new MapFunction<String, Integer>() {
            @Override
            public Integer map(String value) throws Exception {
                return value.length();
            }
        });

        // 2. flatMap,按逗号分割字符串
        DataStream<String> flatMapStream = dataStream.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                String[] fields = value.split(",");
                for(String field:fields){
                    out.collect(field);
                }
            }
        });

        // 3. filter,筛选"sensor_1"开头的数据
        DataStream<String> filterStream = dataStream.filter(new FilterFunction<String>() {
            @Override
            public boolean filter(String value) throws Exception {
                return value.startsWith("sensor_1");
            }
        });

  1. 聚合操作算子
    DataStream里没有reduce和sum这类聚合操作的方法,因为Flink设计中,所有数据必须先分组才能做聚合操作。
    先keyBy得到KeyedStream,然后调用其reduce、sum等聚合操作方法。(先分组后聚合

常见的聚合操作算子主要有:

keyBy

滚动聚合算子Rolling Aggregation

reduce

KeyBy

1、KeyBy会重新分区;
2、不同的key有可能分到一起,因为是通过hash原理实现的;

Rolling Aggregation
这些算子可以针对KeyedStream的每一个支流做聚合。

sum()
min()
max()
minBy()
maxBy()

		// 先分组再聚合
        // 分组
        KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);

        // 滚动聚合,max和maxBy区别在于,maxBy除了用于max比较的字段以外,其他字段也会更新成最新的,而max只有比较的字段更新,其他字段不变
        DataStream<SensorReading> resultStream = keyedStream.maxBy("temperature");

reduce
Reduce适用于更加一般化的聚合操作场景。java中需要实现ReduceFunction函数式接口。

例如:在前面Rolling Aggregation的前提下,对需求进行修改。获取同组历史温度最高的传感器信息,同时要求实时更新其时间戳信息。

 		// 先分组再聚合
        // 分组
        KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);

        // reduce,自定义规约函数,获取max温度的传感器信息以外,时间戳要求更新成最新的
        DataStream<SensorReading> resultStream = keyedStream.reduce(
                (curSensor,newSensor)->new SensorReading(curSensor.getId(),newSensor.getTimestamp(), Math.max(curSensor.getTemperature(), newSensor.getTemperature()))
        );

3 多流转换算子
多流转换算子一般包括:

Split和Select (新版已经移除)

Connect和CoMap

Union

Split和Select:分离

注:新版Flink已经不存在Split和Select这两个API了(至少Flink1.12.1没有!) 需要使用getSideOutput

split&select:将一个DataStream拆分成多个DataStream

使用getSideOutput实现该功能
首先要定义要分的类

	// 定义getSideOutput 的分类
    private static final org.apache.flink.util.OutputTag<SensorReading> high = new org.apache.flink.util.OutputTag<SensorReading>("high"){
    };
    private static final org.apache.flink.util.OutputTag<SensorReading> low = new org.apache.flink.util.OutputTag<SensorReading>("low"){
    };
	// Spilt 分流
	SingleOutputStreamOperator<SensorReading> SplitSensorReading = DataSensorReadingmap.process(new ProcessFunction<SensorReading, SensorReading>() {
            @Override
            public void processElement(SensorReading sensorReading, Context context, Collector<SensorReading> collector) throws Exception {
                // 根据温度划分等级
                if (sensorReading.getTemperature()>30) {
                    context.output(high, sensorReading);

                }else if(sensorReading.getTemperature()<=30){
                    context.output(low,sensorReading);
                }else{
                    collector.collect(sensorReading);
                }
            }
        });

        SplitSensorReading.getSideOutput(high).print("high");
        SplitSensorReading.getSideOutput(low).print("low");

Connect和CoMap:连接
Connect:

DataStream,DataStream -> ConnectedStreams:
连接两个保持他们类型的数据流,两个数据流被Connect之后,只是被放在了一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立

CoMap:

ConnectedStreams -> DataStream:
作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams 中的每一个Stream分别进行map和flatMap操作;

        // connect  将高温流转换二元组类型,与低温流连接合并之后,输出状态信息
        SingleOutputStreamOperator<Tuple2<String,Double>> HighTemperatureWarning = SplitSensorReading.getSideOutput(high).map(new MapFunction<SensorReading, Tuple2<String,Double>>() {
            @Override
            public Tuple2<String, Double> map(SensorReading sensorReading) throws Exception {
                return new Tuple2<>(sensorReading.getId(),sensorReading.getTemperature());
            }
        });

        // 合并 高温流转换二元组类型 与 低温流SensorReading类型
        ConnectedStreams<Tuple2<String, Double>, SensorReading> sensorReadingConnectedStreams = HighTemperatureWarning.connect(SplitSensorReading.getSideOutput(low));

        // CoMap 对每一个Stream分别进行map和flatMap操作
        SingleOutputStreamOperator<Object> ResultStream = sensorReadingConnectedStreams.map(new CoMapFunction<Tuple2<String, Double>, SensorReading, Object>() {
            @Override
            public Object map1(Tuple2<String, Double> stringDoubleTuple2) throws Exception {
                return new Tuple3<>(stringDoubleTuple2.f0, stringDoubleTuple2.f1, "high temp warning");
            }

            @Override
            public Object map2(SensorReading sensorReading) throws Exception {
                return new Tuple2<>(sensorReading.getId(), sensorReading.getTemperature());
            }
        });
        ResultStream.print();

Union:联合(多个)

DataStream ->DataStream:
对两个或者两个以上的DataStream进行Union操作,产生一个包含多有DataStream元素的新DataStream。

Union与Connect的区别:
	 1Connect 的数据类型可以不同,Connect 只能合并两个流; 
	 2Union可以合并多条流,Union的数据结构必须是一样的;
		// union 联合多条
        DataStream<SensorReading> UnionSensorReadingDataStream = SplitSensorReading.getSideOutput(high).union(SplitSensorReading.getSideOutput(low));
        UnionSensorReadingDataStream.print("Union");

1.4总结

Transformation算子就是将一个或多个DataStream转换为新的DataStream
在这里插入图片描述
如上图所示,DataStream会由不同的Transformation操作,转换、过滤、聚合成其他不同的流,从而完成我们的业务要求。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值