flink countWindow计算每个学生的总成绩_flink4. 计算jonathan同学的总成绩是多少-CSDN博客

本文链接：https://blog.csdn.net/ASN_forever/article/details/107037840

需求

假设学校的财务系统要出一个新功能，类似于年度账单。统计每个学生过去一年往一卡通中的总充值金额。

其实这种需求完全不用开窗，可以直接使用批处理，groupBy()后reduce()即可。

当然，也可以使用流处理通过开窗实现聚合。下面分别介绍。

批处理

public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        List<Deposit> list = new ArrayList<>();
        list.add(new Deposit(1,100));
        list.add(new Deposit(2,100));
        list.add(new Deposit(3,100));
        list.add(new Deposit(1,50));
        list.add(new Deposit(2,60));
        list.add(new Deposit(1,60));
        list.add(new Deposit(1,50));

        DataSource<Deposit> source = env.fromCollection(list);
        //这两种只支持Tuple类型的数据
        //source.aggregate(Aggregations.SUM,1);
        //source.sum(1);
        ReduceOperator<Deposit> reduceMoney = source.groupBy("studentID")
                .reduce(new ReduceFunction<Deposit>() {
                    @Override
                    public Deposit reduce(Deposit value1, Deposit value2) throws Exception {
                        value1.setMoney(value1.getMoney() + value2.getMoney());
                        return value1;
                    }
                });
        reduceMoney.print();

    }
    public static class Deposit{
        private int studentID;
        private float money;
        private String dateTime;

        public Deposit() {
        }

        public Deposit(final int studentID, final float money) {
            this.studentID = studentID;
            this.money = money;
        }

        public Deposit(final int studentID, final float money, final String dateTime) {
            this.studentID = studentID;
            this.money = money;
            this.dateTime = dateTime;
        }

        public int getStudentID() {
            return this.studentID;
        }

        public void setStudentID(final int studentID) {
            this.studentID = studentID;
        }

        public float getMoney() {
            return this.money;
        }

        public void setMoney(final float money) {
            this.money = money;
        }

        public String getDateTime() {
            return this.dateTime;
        }

        public void setDateTime(final String dateTime) {
            this.dateTime = dateTime;
        }

        @Override
        public String toString() {
            return "Deposit{" +
                    "studentID=" + studentID +
                    ", money=" + money +
                    '}';
        }
    }

结果：

Deposit{studentID=1, money=260.0}
Deposit{studentID=2, money=160.0}
Deposit{studentID=3, money=100.0}

小结：

对于批处理的分组用的是groupBy()，它有三种重载的方法，接收的参数类型分别是KeySelector、int、String，其中后两种可以传入多个值。对于groupBy(int... fields)来说，只支持Tuple类型的数据流。
对于批处理的累加大概有三种方式，sum()、reduce()、aggregate()，其中sum()是aggregate(SUM,field)的语法糖，sum和aggregate都只支持Tuple类型的数据。

流处理不开窗

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        List<BatchReduce.Deposit> list = new ArrayList<>();
        list.add(new BatchReduce.Deposit(1,100));
        list.add(new BatchReduce.Deposit(2,100));
        list.add(new BatchReduce.Deposit(3,100));
        list.add(new BatchReduce.Deposit(1,50));
        list.add(new BatchReduce.Deposit(2,60));
        list.add(new BatchReduce.Deposit(1,60));
        list.add(new BatchReduce.Deposit(1,50));

        DataStreamSource<BatchReduce.Deposit> source = env.fromCollection(list);

        SingleOutputStreamOperator<BatchReduce.Deposit> sum = source.keyBy("studentID")
                .sum("money");
        sum.print();

        env.execute("stream reduce job");
    }

结果：

6> Deposit{studentID=2, money=100.0}
6> Deposit{studentID=3, money=100.0}
5> Deposit{studentID=1, money=100.0}
5> Deposit{studentID=1, money=150.0}
5> Deposit{studentID=1, money=210.0}
5> Deposit{studentID=1, money=260.0}
6> Deposit{studentID=2, money=160.0}

小结：

流处理的分组用的是keyBy()
累加可以用sum()或reduce()，其中sum()也支持pojo对象
因为是流式数据，因此在没有开窗的情况下，每来一条数据就会进行一次计算和print

流处理开窗

窗口根据不同的标准可以做不同的划分，按照是否是keyed stream可以分成window和windowAll两种；这两种类型下按照开窗条件划分又有基于时间的timewindow/timeWindowAll，也有基于数量的countwindow/countWindowAll。

其中windowAll类型的窗口是单并行度的。

这里因为要根据studentID分组，因此采用的是countwindow。

窗口的聚合函数也有多种，对于每种的具体用法可以看官网，也可以看源码：

sum()
aggregate()
reduce()
process()

其中reduce和aggregate是分别需要传入自定义的ReduceFunction和AggregateFunction，这两种窗口函数采用的是递增聚合的方式，比全量缓存聚合函数ProcessWindowFunction要高效，性能也好。这个在另一篇也有介绍。

对于sum()来说，底层采用的也是aggregate()方法。

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        List<BatchReduce.Deposit> list = new ArrayList<>();
        list.add(new BatchReduce.Deposit(1,100));
        list.add(new BatchReduce.Deposit(2,100));
        list.add(new BatchReduce.Deposit(3,100));
        list.add(new BatchReduce.Deposit(1,50));
        list.add(new BatchReduce.Deposit(2,60));
        list.add(new BatchReduce.Deposit(1,60));
        list.add(new BatchReduce.Deposit(1,50));

        DataStreamSource<BatchReduce.Deposit> source = env.fromCollection(list);

        SingleOutputStreamOperator<BatchReduce.Deposit> sum = source.keyBy("studentID")
                .countWindow(2)
                .sum("money");

        sum.print();

        env.execute("stream reduce job");
    }

结果：

6> Deposit{studentID=2, money=160.0}
5> Deposit{studentID=1, money=150.0}
5> Deposit{studentID=1, money=110.0}

可以看到，因为设置了窗口大小为2，对于id为3的同学由于只有一条数据，因此达不到触发条件，导致数据“丢失”。

而对于id为1的学生，总共有四条数据，因此开了两个窗口，但是只返回了当前窗口的计算结果，没有累加所有窗口的结果，也不满足需求。

这个我们可以通过countWindow的源码证明，默认的触发机制是窗口元素数量：

可以看到countWindow创建的是全局窗口GlobalWindows，并指定了触发器PurgingTrigger（全局窗口必须指定触发器，默认是永远不触发的）。

其中PurgingTrigger类源码如下：

可以看到，PurgingTrigger类起到的类似于转换作用，就是将传入的任何触发器转换成一个purging类型的触发器，返回FIRE_AND_PURGE（触发计算，然后清除窗口内的元素）。

下面看一下CountTrigger触发器的源码，看一下触发器是如何定义的：

/**
 * A {@link Trigger} that fires once the count of elements in a pane reaches the given count.
 *
 * @param <W> The type of {@link Window Windows} on which this trigger can operate.
 */
@PublicEvolving
public class CountTrigger<W extends Window> extends Trigger<Object, W> {
	private static final long serialVersionUID = 1L;

	private final long maxCount;

	private final ReducingStateDescriptor<Long> stateDesc =
			new ReducingStateDescriptor<>("count", new Sum(), LongSerializer.INSTANCE);

	private CountTrigger(long maxCount) {
		this.maxCount = maxCount;
	}

	@Override
	public TriggerResult onElement(Object element, long timestamp, W window, TriggerContext ctx) throws Exception {
		ReducingState<Long> count = ctx.getPartitionedState(stateDesc);
		count.add(1L);
		if (count.get() >= maxCount) {
			count.clear();
			return TriggerResult.FIRE;
		}
		return TriggerResult.CONTINUE;
	}

	@Override
	public TriggerResult onEventTime(long time, W window, TriggerContext ctx) {
		return TriggerResult.CONTINUE;
	}

	@Override
	public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
		return TriggerResult.CONTINUE;
	}

	@Override
	public void clear(W window, TriggerContext ctx) throws Exception {
		ctx.getPartitionedState(stateDesc).clear();
	}

	@Override
	public boolean canMerge() {
		return true;
	}

	@Override
	public void onMerge(W window, OnMergeContext ctx) throws Exception {
		ctx.mergePartitionedState(stateDesc);
	}

	@Override
	public String toString() {
		return "CountTrigger(" +  maxCount + ")";
	}

	/**
	 * Creates a trigger that fires once the number of elements in a pane reaches the given count.
	 *
	 * @param maxCount The count of elements at which to fire.
	 * @param <W> The type of {@link Window Windows} on which this trigger can operate.
	 */
	public static <W extends Window> CountTrigger<W> of(long maxCount) {
		return new CountTrigger<>(maxCount);
	}

	private static class Sum implements ReduceFunction<Long> {
		private static final long serialVersionUID = 1L;

		@Override
		public Long reduce(Long value1, Long value2) throws Exception {
			return value1 + value2;
		}

	}
}

可以看到，它的主要部分就是onElement()方法，用了一个ReducingStateDescriptor状态数据来对窗口中的数据量进行累加，当数据量达到指定的窗口大小时，就会clear清空状态数据并触发窗口函数。

对于onEventTime()和onProcessingTime()都是返回的TriggerResult.CONTINUE，也就是不触发。

小结：

countwindow默认只能对当前窗口实例（per-window）进行聚合，而不能对当前分组的所有窗口数据进行最终的聚合。为了解决这个问题，可以通过ProcessWindowFunction定义状态数据，在不同窗口实例中共享状态数据来完成。
countwindow底层的窗口分配器是GolbalWindow，指定了计数的触发器
默认情况下，countwindow的窗口中只有数据量达到窗口大小时才会触发窗口函数（FIRE_AND_PURGE），因此如果窗口中数据量不够时，这部分数据默认是不会触发窗口函数的。
为了解决这个问题，需要自定义触发器，让窗口在数量或时间达到指定条件时都可以触发。

下面先从解决不同per-window的数据无法汇总开始。

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        List<BatchReduce.Deposit> list = new ArrayList<>();
        list.add(new BatchReduce.Deposit(1,100));
        list.add(new BatchReduce.Deposit(2,100));
        list.add(new BatchReduce.Deposit(3,100));
        list.add(new BatchReduce.Deposit(1,50));
        list.add(new BatchReduce.Deposit(2,60));
        list.add(new BatchReduce.Deposit(1,60));
        list.add(new BatchReduce.Deposit(1,50));

        DataStreamSource<BatchReduce.Deposit> source = env.fromCollection(list);

        /*SingleOutputStreamOperator<BatchReduce.Deposit> sum = source.keyBy("studentID")
                .countWindow(2)
                .sum("money");*/
        SingleOutputStreamOperator<BatchReduce.Deposit> sum = source.keyBy(new KeySelector<BatchReduce.Deposit, Integer>() {
            @Override
            public Integer getKey(BatchReduce.Deposit value) throws Exception {
                return value.getStudentID();
            }
        })
                .countWindow(2)
                .process(new ProcessWindowFunction<BatchReduce.Deposit, BatchReduce.Deposit, Integer, GlobalWindow>() {
                    private ValueState<Tuple2<Integer, Float>> valueState;
                    @Override
                    public void open(Configuration parameters){
                        // 创建 ValueStateDescriptor
                        ValueStateDescriptor descriptor = new ValueStateDescriptor("depositSumStateDesc",
                                TypeInformation.of(new TypeHint<Tuple2<Integer, Float>>() {}));

                        // 基于 ValueStateDescriptor 创建 ValueState
                        valueState = getRuntimeContext().getState(descriptor);
                    }
                    @Override
                    public void process(Integer tuple, Context context, Iterable<BatchReduce.Deposit> elements, Collector<BatchReduce.Deposit> out) throws Exception {
                        context.windowState();

                        Tuple2<Integer, Float> currentState = valueState.value();
                        // 初始化 ValueState 值
                        if (null == currentState) {
                            currentState = new Tuple2<>(elements.iterator().next().getStudentID(), 0f);
                        }
                        float sum = 0f;
                        for (BatchReduce.Deposit deposit:elements){
                            sum += deposit.getMoney();
                        }
                        currentState.f1 = currentState.f1 + sum;

                        // 更新 ValueState 值
                        valueState.update(currentState);
                        BatchReduce.Deposit deposit = new BatchReduce.Deposit();
                        deposit.setStudentID(currentState.f0);
                        deposit.setMoney(currentState.f1);
                        out.collect(deposit);
                    }
                });

        sum.print();

        env.execute("stream reduce job");
    }

结果：

6> Deposit{studentID=2, money=160.0}
5> Deposit{studentID=1, money=150.0}
5> Deposit{studentID=1, money=260.0}

可以看到，对于id为1的同学，第二个窗口输出的汇总结果是包含第一个窗口的汇总数据的。

但对于id为3的同学来说，由于不满足默认触发器的触发条件，导致一直不输出。

下面就通过自定义触发器解决这个问题，让在满足数据量或满足超时时间时，触发窗口函数。

(暂未完成。。。)