写在前面的话
上周发现了windows10上面的liunx子系统,感觉这东西真的太好了,然后昨天尝试着在(ubuntu16.04)上面搭建实时流环境,各个组件安装基本顺利,但在联合部署时踩到比较大的坑,把程序打包后提交到yarn上面,一直报错,大概意思是资源不够,一直倒腾到晚上7点多,无果,切换到Ubuntu16.04真实环境上重新来了一遍。结论,windows子系统的定位是开发环境,所以最好不要在上面做部署调试。
需要安装的组件
jdk1.8.0_111
scala-2.11.7
node-v6.14.1-linux-x64
apache-flume-1.7.0-bin/
apache-hive-2.1.1-bin/
apache-maven-3.3.3/
elasticsearch-6.2.3/
elasticsearch-head-master/
hadoop-2.7.1/
kafka_2.11-0.10.1.1/
spark-2.1.1-bin-hadoop2.7/
用到的环境变量
# /etc/profile
export JAVA_HOME=/usr/local/jvm/jdk1.8.0_111
export JRE_HOME=/usr/local/jvm/jdk1.8.0_111/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export SCALA_HOME=/usr/local/jvm/scala-2.11.7
export NODE_HOME=/usr/local/node/node-v6.14.1-linux-x64
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$JAVA_HOME:$SCALA_HOME/bin:$NODE_HOME/bin:$PATH
# ~/.bashrc
export MAVEN_HOME=/home/futhead/program/apache-maven-3.3.3
export SPARK_HOME=/home/futhead/program/spark-2.1.1-bin-hadoop2.7
export HADOOP_HOME=/home/futhead/program/hadoop-2.7.1
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/home/futhead/program/apache-hive-2.1.1-bin
export KAFKA_HOME=/home/futhead/program/kafka_2.11-0.10.1.1
export FLUME_HOME=/home/futhead/program/apache-flume-1.7.0-bin
export PATH="$MAVEN_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$HIVE_HOME/bin:$KAFKA_HOME/bin:$FLUME_HOME/bin:$PATH"
基本架构
flume --> kakfa(依赖zookeeper) --> spark-streaming(提交到yarn) --> (写入)elasticsearch + mysql
操作步骤
模拟业务数据源
新建/home/futhead/log/mock-data.log模拟业务数据源
flume抽取数据到kafka
flume配置:
# 定义 agent
a1.sources = src1
a1.channels = ch1
a1.sinks = k1
# 定义 sources
a1.sources.src1.type = exec
a1.sources.src1.command = tail -F /home/futhead/log/mock-data.log
a1.sources.src1.channels=ch1
# 定义 sinks
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = test
a1.sinks.k1.brokerList = futhead:9092
a1.sinks.k1.batchSize = 20
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.channel = ch1
# 定义 channels
a1.channels.ch1.type = memory
a1.channels.ch1.capacity = 1000
启动:
flume-ng agent --conf conf --conf-file conf/a1.conf --name a1 -Dflume.root.logger=INFO,console
kafka
这里用了卡夫卡内置的zookeeper:
启动zk:
zookeeper-server-start.sh config/zookeeper.properties &
启动kafka
kafka-server-start.sh config/server.properties
创建topic
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
查看topic
kafka-topics.sh --list --zookeeper localhost:2181
spark-streaming集成kafka
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.futhead</groupId>
<artifactId>streaming</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.11.7</scala.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!--<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.0.0</version>
</dependency>-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.40</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>6.2.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<resources>
<resource>
<directory>src/main/resource</directory>
</resource>
</resources>
<plugins>
<!-- scala 打包插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<appendAssemblyId>false</appendAssemblyId> <!-- this is used for not append id to the jar name -->
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
整合代码
package com.futhead.streaming
import java.util.Properties
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, TaskContext}
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.{HasOffsetRanges, KafkaUtils, OffsetRange}
import org.elasticsearch.spark.sql._
/**
* Created by futhead on 19-1-13.
*/
object KafkaSqlWordCount {
def main(args:Array[String]): Unit ={
val conf = new SparkConf()
// .setMaster("local[2]")
.setAppName("KafkaSqlWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val prop = new Properties()
prop.put("user", "root")
prop.put("password", "futhead")
prop.put("driver","com.mysql.jdbc.Driver")
val topics = Array("test")
val messages = KafkaUtils.createDirectStream [String,String](
ssc,
PreferConsistent,
Subscribe[String,String](topics, kafkaParams)
)
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
// val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
// wordCounts.print()
words.foreachRDD { (rdd: RDD[String], time: Time) =>
// Get the singleton instance of SparkSession
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
import spark.implicits._
// Convert RDD[String] to RDD[case class] to DataFrame
val wordsDataFrame = rdd.map(w => Record(w)).toDF()
// Creates a temporary view using the DataFrame
wordsDataFrame.createOrReplaceTempView("words")
// Do word count on table using SQL and print it
val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame
.write.mode("append")
.jdbc("jdbc:mysql://localhost:3306/wordcount", "wordcount.wordcount", prop)
wordCountsDataFrame.saveToEs("wordcount/wordcount", Map("es.mapping.id" -> "word"))
}
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
编译打包
mvn clean install
启动elasticsearch
./elasticsearch -d
启动header插件
npm run start
创建数据库
CREATE DATABASE IF NOT EXISTS wordcount default charset utf8 COLLATE utf8_general_ci;
部署到yarn
启动yarn并提交任务:
start-dfs.sh
start-yarn.sh
spark-submit --master yarn --deploy-mode cluster --class com.futhead.streaming.KafkaSqlWordCount target/streaming-1.0-SNAPSHOT.jar
到http://localhost:8088/cluster看一眼
测试:
在/home/futhead/log/mock-data.log文件中随便写点东西
看下数据库中的数据
看下es中的数据
好吧,基本流程是没太大问题了。