目录
FLUME
Flume原理
一个独立的
flume
进程称之为
agent
,每一个
agent
包含
3
个组件:
source
、
channel
、
sink
- source:用于采集数据,与数据源进行对接,source是产生数据流的源头,同时会将收集的数据传输给channel。常用source包括:netcat、exec、http、avro、spooldir、kafka、自定义
- channel:连接source和sink,类似于是一个队列,数据先进先出,还可以进行数据的缓冲。常用channel包括:memory channel、file channel…
- sink:从channel拉取数据,然后将数据写入目标端。常用sink包括:hdfs、logger、kafka、 hive、avro、自定义
具体知识可参考这篇文章:
https://blog.csdn.net/ddzzz_/article/details/114712952
安装与配置
#拉取flume镜像
docker pull probablyfine/flume:2.0.0
#创建一个flume文件夹 里面包含 conf logs
mkdir /data/Lake/flume
cd /data/Lake/flume
mkdir conf logs flume_log test_logs
chmod -R 777 /data/Lake/flume
cd conf
vi los-flume-kakfa.conf #内容见下节
docker run -itd --name flume --restart always -v
/data/Lake/flume/conf:/opt/flume-config/flume.conf -v
/data/Lake/flume/flume_log:/var/tmp/flume_log -v
/data/Lake/flume/logs:/opt/flume/logs -v /tmp/test_logs/:/tmp/test_logs/ -
e FLUME_AGENT_NAME="agent" probablyfine/flume:2.0.0
docker exec -it flume /bin/bash
nohup /opt/flume/bin/flume-ng agent -c /opt/flume/conf -f /opt/flumeconfig/flume.conf/los-flume-kakfa.conf -n a1 &
一个完整的
sources/channels/sinks
的配置样例
los-flume-kakfa.conf
# vi los-flume-kakfa.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /tmp/test_logs/app.log #这里定义了flume获取日志的源文件
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink #指定数据写入目标类型为kafka
a1.sinks.k1.topic = test #这里的topic的名字很重要,在kafaka生产和消费的时候需要用到。
a1.sinks.k1.brokerList = 172.17.0.1:9092 #kafka的地址,集群多个地址用逗号分隔
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
ZOOKEEPER/KAFKA
安装与配置
docker pull wurstmeister/zookeeper
docker run -d --restart=always --log-driver json-file --log-opt maxsize=100m --log-opt max-file=2 --name zookeeper -p 2181:2181 -v
/etc/localtime:/etc/localtime wurstmeister/zookeeper
docker pull wurstmeister/kafka:2.11-0.11.0.3 #或者版本2.12-2.3.0
docker run -d --restart=always --log-driver json-file --log-opt maxsize=100m --log-opt max-file=2 --name kafka -p 9092:9092 -e
KAFKA_BROKER_ID=0 -e KAFKA_ZOOKEEPER_CONNECT=172.17.0.1:2181/kafka -e
KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.17.0.1:9092 -e
KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092 -v /etc/localtime:/etc/localtime
wurstmeister/kafka:2.11-0.11.0.3
# 参数说明:
# -e KAFKA_BROKER_ID=0 在kafka集群中,每个kafka都有一个BROKER_ID来区分自己;
# -e KAFKA_ZOOKEEPER_CONNECT=172.17.0.1:2181/kafka 配置zookeeper管理kafka的路径;
# -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.17.0.1:9092 把kafka的地址端口注册给zookeeper,如果是远程访问要改成外网IP,类如Java程序访问出现无法连接;
# -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092 配置kafka的监听端口;
# -v /etc/localtime:/etc/localtime 容器时间同步虚拟机的时间
# kafka管理平台(可装可不装)
docker pull docker.io/sheepkiller/kafka-manager
docker run -it -d --name kafka-manager --rm -p 9000:9000 -e
ZK_HOSTS="172.17.0.1:2181" sheepkiller/kafka-manager
firewall-cmd --add-port=9000/tcp
#创建成后,在浏览器中访问 http://ds2.andunip.cn:9000
flume+kafka测试
### 运行kafka消费者接收消息
docker exec -it kafka bash #进入容器
#./bin/kafka-console-consumer.sh --bootstrap-server node1:9092,node2:9092,node3:9092 --topic topicName #可以接收多个节点的消息
/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --
topic test --from-beginning # --from-beginning从指定主题中有效的起始位移位置开始消费所有分区的消息
#/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --property print.key=true --topic test --from-beginning #消费出的消息结果将打印出消息体的 key 和 value
# 可以通过向/tmp/test_logs/app.log里添加日志来测试kafaka是否输出
echo hello >> /tmp/test_logs/app.log
echo "{\"id\":105,\"name\":\"ee655kkjj\"}" >> /tmp/test_logs/app.log #注意如果要求json格式,需要用\
FLINK访问kafka
Flink安装与配置
# https://flink.apache.org/
docker pull flink:1.13.2-scala_2.12 #最新1.15.0,会涉及到Java版本,很麻烦。
mkdir /data/Lake/flink /data/Lake/flink/job
chmod -R 777 /data/Lake/flink
cd /data/Lake/flink
vim docker-flink.yml #输入下下节的内容。
docker-compose -f docker-flink.yml up -d
#使用 http://localhost:8081 查看任务管理页面。http://ds2.andunip.cn:8081
docker exec -it flink-job /bin/bash
# /opt/flink/bin/start-cluster.sh
#测试简单例子,运行word-count
python /opt/flink/examples/python/table/batch/word_count.py
cat /tmp/result/1
# 下载相关的.jar,支持JSON和kafka
# https://flink-packages.org/categories/connectors 可以去这里查看flink支持的
connectors
cd /opt/flink/lib #需要把可用的jar包放到这个文件夹。注意kafka和flink的版本,这里kafka是2.11,flink是1.13.2
wget
https://maven.aliyun.com/nexus/content/groups/public/org/apache/flink/flinksql-connector-kafka_2.11/1.13.2/flink-sql-connector-kafka_2.11-1.13.2.jar
wget
https://maven.aliyun.com/nexus/content/groups/public/org/apache/flink/flinkjson/1.10.1/flink-json-1.10.1-sql-jar.jar
wget https://maven.aliyun.com/repository/public/org/apache/flink/flinkconnector-redis_2.11/1.1.5/flink-connector-redis_2.11-1.1.5.jar
#需要重启一下flink才能使用kafka源
#安装python和pyflink,最好新更换一下apt的源
apt-get update
apt-get install -y python3
apt-get install -y python3-pip
echo alias python=python3 >> ~/.bashrc
source ~/.bashrc
python -m pip install --upgrade pip -i https://pypi.douban.com/simple
python -m pip install apache-flink==1.13.2 -i https://pypi.douban.com/simple
#需要跟docker的flink版本一致。一般的最老版是1.7,目前1.15.0最新版,但很多资料跟不上
# pyflink有 两大类
# PyFlink DataStream API 提供了对 Flink 核心模块以及状态和时间语义的较低级别的控制,可以用它来构建更为复杂的流式处理程序。
# PyFlink Table API & SQL 允许你编写功能强大的关系型查询就类似在Python 使用SQL或处理表格数据那样。
# https://nightlies.apache.org/flink/flink-docsmaster/docs/dev/python/overview/
docker-flink.yml
version: "2.1"
services:
jobmanager:
image: flink:1.13.2-scala_2.12
hostname: flink-jobmanager
container_name: flink-job
volumes:
# 自行修改数据卷的映射位置
- /data/Lake/flink/job:/opt/flink/job
expose:
- "6123"
ports:
- "8081:8081"
command: jobmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
taskmanager:
image: flink:1.13.2-scala_2.12
hostname: flink-taskmanager
container_name: flink-task
expose:
- "6121"
- "6122"
depends_on:
- jobmanager
command: taskmanager
links:
- "jobmanager:jobmanager"
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
通过flink访问kafka
一个简单的例子:
pyflink
实时接收
kafka
数据并处理
(job-kafka2print.py)
cat>job-kafka2print.py<<EOF
#!/usr/bin/python
# pyflink实时接收kafka数据至print
# -*- coding: UTF-8 -*-
from pyflink.table import EnvironmentSettings, TableEnvironment
# 1. 创建 TableEnvironment
env_settings =
EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().b
uild()
table_env = TableEnvironment.create(env_settings)
# 2. 创建 source 表
table_env.execute_sql("""
CREATE TABLE datagen (
id INT,
name VARCHAR
) WITH (
'connector' = 'kafka',
'topic' = 'test',
'properties.bootstrap.servers' = '172.17.0.1:9092',
'properties.group.id' = 'test_Print',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
""")
# 3. 创建 sink 表
table_env.execute_sql("""
CREATE TABLE print (
id INT,
name VARCHAR
) WITH (
'connector' = 'print'
)
""")
# 4. 查询 source 表,同时执行计算
# 通过 Table API 创建一张表:
source_table = table_env.from_path("datagen")
# 或者通过 SQL 查询语句创建一张表:
#source_table = table_env.sql_query("SELECT * FROM datagen")
result_table = source_table.select(source_table.id, source_table.name)
print("result tabel:",type(result_table))
#print("r data: ",source_table.name)
# 5. 将计算结果写入给 sink 表
# 将 Table API 结果表数据写入 sink 表:
result_table.execute_insert("print").wait()
# 或者通过 SQL 查询语句来写入 sink 表:
#table_env.execute_sql("INSERT INTO print SELECT * FROM datagen").wait()
EOF