文件流
在spark/mycode/streaming/logfile目录下新建两个日志文件log1.txt和log2.txt,随便输入内容。比如,在log1.txt中输入以下内容:
I love Hadoop
I love Spark
Spark is fast
进入spark-shell创建文件流。另起一个终端窗口,启动进入spark-shell
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(20))
val lines = ssc.textFileSteam("file:///usr/local/spark/mycode/streaming/logfile")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x=>(x, 1)).reduceByKey(_+_)
wordConnts.print()
ssc.start()
ssc.awaitTermination()
上面在spark-shell中执行的程序,一旦你输入ssc.start()以后,程序就开始自动进入循环监听状态,屏幕上会显示一堆信息。
在”/usr/local/spark/mycode/streaming/logfile”目录下再新建一个log3.txt文件,就可以在监听窗口中显示词频统计结果。
下面采用独立应用程序的方式实现上述监听文件夹的功能
$ /usr/local/spark/mycode
$ mkdir streaming
$ cd streaming
$ mkdir -p src/main/scala
$ cd src/main/scala
$ vim TestStreaming.scala
用vim编辑器新建一个TestSteaming.scala代码文件,代码如下:
import org.apache.spark._
import org.apache.spark.streaming._
object WordCountStreaming {
def main(args: Array[Strinig])
{
//设置为本地运行模式,2个线程,一个监听,另一个处理数据
val sparkConf = new SparkConf().setAppName("WordCountStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2)) //时间间隔为2秒
//这里采用本地文件,当然也可采用HDFS文件
val lines = ssc.textFileStream("file:///usr/local/spark/mycode/streaming/logfile")
val words = lines.flatMap(_.split(" "))
val wordConts = words.map(x=>(x, 1)).reduceByKey(_+_)
wordConnts.print()
ssc.start()
ssc.awaitTermination()
}
}
然后在streaming目录下
vim simple.sbt
在simple.sbt文件中输入以下代码:
name:="Simple Project"
version:="1.0"
scalaVersion:="2.11.8"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.3.0"
执行sbt打包编译的命令:
$ /usr/local/sbt/sbt package
打包成功以后,就可以输入以下命令启动这个程序了:
$ cd /usr/local/spark/mycode/streaming
$ /usr/local/spark/bin/spark-submit --class "WordConntStreaming" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar
执行上面命令后就进入了监听状态(我们把运行这个监听程序的窗口称为监听窗口)
切换到另一个Shell窗口,在”/usr/local/spark/mycode/streaming/logfile”目录下再新建一个log5.txt文件,文件里面随便输入一些单词,保存好文件退出vim编辑器
再次切换回”监听窗口”,等待20秒以后,按键盘Ctrl+C或者Ctrl+D停止监听程序,就可以看到监听窗口的屏幕上会打印出单词统信息。
套接字流
Spark Streaming可以通过Socket端口监听并接收数据,然后进行相应处理
$ /usr/local/spark/mycode
$ mkdir streaming
$ cd streaming
$ mkdir -p src/main/scala
$ cd src/main/scala
$ vim NetworkWordCount.scala
NetworkWordCount.scala文件内容如下:
package org.apacke.spark.exmaples.streaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args:Array[String]) {
if(args.length < 2)
{
System.err.println("Usage:NetworkWordCount<hostname><port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordConts = words.map(x=>(x, 1)).reduceByKey(_+_)
wordConnts.print()
ssc.start()
ssc.awaitTermination()
}
}
在相同目录下再新建另外一个文件StreamingExamples.scala,文件内容如下:
package org.apacke.spark.exmaples.streaming
import org.apache.spark.internal.Logging
import org.apache.log4j.{Level, Logger}
/** Utility functions for Spark Streaming examples. */
object StreamingExamples extends Logging {
/** Set reasonable logging levels for streaming if the user has not configured log4j. */
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if(!log4jInitialized){
//We first log something to initialize Spark's default loggin, then we override the logging level
logInfo("Setting log level to [WARN] for steaming exmaple."+" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
然后在streaming目录下
vim simple.sbt
在simple.sbt文件中输入以下代码:
name:="Simple Project"
version:="1.0"
scalaVersion:="2.11.8"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.3.0"
执行sbt打包编译的命令:
$ /usr/local/sbt/sbt package
打包成功以后,就可以输入以下命令启动这个程序了:
$ cd /usr/local/spark/mycode/streaming
$ /usr/local/spark/bin/spark-submit --class "org.apache.spark.examples.streaming.NetworkWordCount" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar localhost 9999
新打开一个窗口作为nc窗口,启动nc程序:
$ nc -lk 9999
可以在nc窗口中随意输入一些单词,监听窗口就会自动获得单词数据流信息,在监听窗口每个1秒就会打印出词频统计信息。
下面我们进一步把数据源的产生方式修改一下,不要使用nc程序,而是采用自己编写的程序产生Socket数据源
$ cd /usr/local/spark/mycode/streaming/src/main/scala
$ vim DataSourceSocket.scala
package org.apache.spark.examples.streaming
import java.io.{PrintWriter}
import java.netServerSocket
import scala.io.Source
object DataSourceSocket {
def index(lenght:Int)={
val rdm = new java.util.Random
rdm.nextInt(length)
}
def main(args:Array[String]){
if(args.lenght != 3){
System.err.println("Usage:<filename><port><milliseconds>")
System.exit(1)
}
val fileName = args(0)
val lines = Source.fromFile(fileName).getLines.toList
val rowCount = lines.length
val listener = new ServerSocket(args(1).toInt)
while(true){
val socket = listener.accept()
new Thread(){
override def run = {
println("Got client connected from:"+socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream(), true)
while(true){
Thread.sleep(args(2).toLong)
val content = lines(index(rowCount))
println(content)
out.write(content+"\n")
out.flush()
}
socket.close()
}
}.start()
}
}
}
执行sbt打包编译:
$ cd /usr/local/spark/mycode/streaming
$ /usr/local/sbt/sbt package
DataSourceSocket程序需要把一个文本文件作为输入参数,所以,在启动这个程序之前,需要先创建一个文本文件word.txt并随便输入几行内容。
启动DataSourceSocket程序:
$ cd /usr/local/spark/mycode/streaming
$ /usr/local/spark/bin/spark-submit --class "org.apache.spark.examples.streaming.DataSourceSocket" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar /usr/local/spark/mycode/streaming/word.txt 9999 1000
这个窗口会不断打印出一些随机读取到的文本信息,这些信息也是Socket数据源,会被监听程序捕捉到。
在另外一个窗口启动监听程序:
$ /usr/local/spark/bin/spark-submit --class "org.apache.spark.examples.streaming.NetworkWordCount" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar localhost 9999
启动成功后,你就可以看到屏幕上不断打印出词频统计信息。
RDD队列流
在调试Spark Streaming应用程序时,我们可以使用streamingContext.queueStreaming(queueOfRDD)创建基于RDD队列的DStream。
下面新建一个TestRDDQueueStream.scala文件,需要完成的功能是:每隔1秒创建一个RDD,Streaming每隔2秒对数据进行处理。
package org.apacke.spark.exmaples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.rdd.RDD
import org.apache.spark.streaming.StringContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object QueueStream {
def main(args:Array[String])
{
val sparkConf = new SparkConf().setAppName("TestRDDQueue").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val rddQueue = new scala.collection.mutable.SynchronizedQueue[RDD[Int]]()
val queueStream = ssc.queueStream(rddQueue)
val reducedStream = mappedStream.reduceByKey(_+_)
reducedStream.print()
ssc.start()
for(i <- 1 to 10)
{
rddQueue += ssc.sparkContext.makeRDD(1 to 100, 2)
Thread.sleep(1000)
}
ssc.stop()
}
}
sbt打包成功后,执行下面命令运行程序:
$ cd /usr/local/spark/mycode/streaming
$ /usr/local/spark/bin/spark-submit --class "org.apache.spark.examples.streaming.QueueStream" /usr/local/spark/mycode/streaming/target/scala-2.11/simple-project_2.11-1.0.jar
执行上面命令后,程序就开始运行了。
自定义接收器(Receiver)
SparkStreaming 能够接收任意类型的流式数据,不单单只是内建的Flume,Kafka,Kinesis,files,sockets等等。当然若要支持此种数据,则需要开发者自定义程序来接受对应的数据源。
自定义一个接收器类,通常需要继承原有的基类,在这里需要继承自Receiver,该虚基类有两个方法需要重写分别是
- onstart() 接收器开始运行时触发方法,在该方法内需要启动一个线程,用来接收数据。
- onstop() 接收器结束运行时触发的方法,在该方法内需要确保停止接收数据。
当然在接收数据流过程中也可能会发生终止接收数据的情况,这时候onstart内可以通过isStoped()来判断 ,是否应该停止接收数据
以下是接收一个套接字上的文本流的自定义接收器。它以文本流中的“\”分隔线分割,并将它们储存在spark中。如果接收线程有任何连接错误或接收错误,则接收器将重新启动。
class MyReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2){
override def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
override def run() { receive() }
}.start()
}
override def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
var socket: Socket = null
var userInput: String = null
try {
// Connect to host:port
socket = new Socket(host, port)
// Until stopped or connection broken continue reading
val reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
userInput = reader.readLine()
while(!isStopped && userInput != null) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
socket.close()
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again")
} catch {
case e: java.net.ConnectException =>
// restart if could not connect to server
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
// restart if there is any other error
restart("Error receiving data", t)
}
}
}
使用自定义接收器
val stream = ssc.receiverStream(new MyReceiver("218.193.154.155",9999))
参考官网:http://spark.apache.org/docs/latest/streaming-custom-receivers.html