Flume + Kafka + TridentStorm + Hbase项目实战
版权声明:禁止转载,转载必究
标签(空格分隔): Storm项目
Write by Vin
1,项目简介
项目名称:基于Storm开发实现的实时网站流量统计
项目需求:通过Storm分析业务系统产生的网站访问日志数据,实时的统计出各种PV,包括:
每个URL单独的PV
网站外链PV
搜索关键字PV
项目技术架构:
本文目的旨在记录配置要点,以方便以后查看,故均按简单方式搭建环境,并且通过代码生成日志来模拟Nginx日志信息,并使用一层flume来进行该日志的监控
2,数据模拟
2.1数据模拟与环境搭建
1,生成日志
日志样例:
132.46.30.61 - - [1476285399264] "GET /list.php HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Linux; Android 4.2.1; Galaxy Nexus Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19" "-"
215.168.214.201 - - [1476285965677] "GET /edit.php HTTP/1.1" 200 0 "http://www.google.cn/search?q=spark mllib" "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "-"
通过日志样例,使用scala编程每个1s生成一条日志信息,代码如下:文件名称 NginxLogGenerator.scala
package org.project.storm.study
/**
* Created by hp-pc on 2016/10/16.
*/
import scala.collection.immutable.IndexedSeq
import scala.util.Random
/**
* Created by ad on 2016/10/13.
*/
class NginxLogGenerator {
}
object NginxLogGenerator{
/** user_agent **/
val userAgents: Map[Double, String] = Map(
0.0 -> "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2;Tident/6.0)",
0.1 -> "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2;Tident/6.0)",
0.2 -> "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1;Tident/4.0; .NETCLR 2.0.50727)",
0.3 -> "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
0.4 -> "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
0.5 -> "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
0.6 -> "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
0.7 -> "Mozilla/5.0 (iPhone; CPU iPhone OS 7_)_3 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11B511 Safari/9537.53",
0.8 -> "Mozilla/5.0 (Linux; Android 4.2.1; Galaxy Nexus Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19",
0.9 -> "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
1.0 -> " "
)
/**IP**/
val ipSliceList = List(10,28,29,30,43,46,53,61,72,89,96,132,156,122,167,
143,187,168,190,201,202,214,215,222)
/** url **/
val urlPathList = List(
"login.php","view.php","list.php","upload.php","admin/login.php","edit.php","index.html"
)
/** http_refer **/
val httpRefers = List(
"http://www.baidu.com/s?wd=%s",
"http://www.google.cn/search?q=%s",
"http://www.sogou.com/web?query=%s",
"http://www.yahoo.com/s?p=%s",
"http://cn.bing.com/search?q=%s"
)
/** search_keyword **/
val searchKeywords = List(
"spark",
"hadoop",
"yarn",
"hive",
"mapreduce",
"spark mllib",
"spark sql",
" IPphoenix",
"hbase"
)
val random = new Random()
/** ip **/
def sampleIp(): String ={
val ipEles: IndexedSeq[Int] = (1 to 4).map{
case i =>
val ipEle: Int = ipSliceList(random.nextInt(ipSliceList.length))
//println(ipEle)
ipEle
}
ipEles.iterator.mkString(".")
}
/**
* url
* @return
*/
def sampleUrl(): String ={
urlPathList(random.nextInt(urlPathList.length))
}
/**
* user_agent
* @return
*/
def sampleUserAgent(): String ={
val distUppon = random.nextDouble()
userAgents("%#.1f".format(distUppon).toDouble)
}
/** http_refer **/
def sampleRefer()={
val fra = random.nextDouble()
if(fra > 0.2)
"-"
else {
val referStr = httpRefers(random.nextInt(httpRefers.length))
val queryStr = searchKeywords(random.nextInt(searchKeywords.length))
referStr.format(queryStr)
}
}
def sampleOneLog() ={
val time = System.currentTimeMillis()
val query_log = "%s - - [%s] \"GET /%s HTTP/1.1\" 200 0 \"%s\" \"%s\" \"-\"".format(
sampleIp(),
time,
sampleUrl(),
sampleRefer(),
sampleUserAgent()
)
query_log
}
def main(args: Array[String]) {
while(true){
println(sampleOneLog())
Thread.sleep(1000)
}
}
}
执行示例:
2,模拟Nginx服务器
在Linux中新建一个目录mkdir ~/project_workspace
将NginxLogGenerator.scala文件拷贝到该新创建的目录中
编辑Linux脚本执行该scala文件:文件名为generator_log.sh
:
代码如下
#!/usr/bin
SCALAC='/usr/bin/scalac'
$SCALAC NginxLogGenerator.scala
SCALA='/usr/bin/scala'
$SCALA /类所在路径/NginxLogGenerator >> nginx.log
执行sh generator_log.sh
,就会在该目录下生成nginx.log文件
可通过tail -f nginx.log
查看,中断 CTRL + C 或者 jps获取pid然后使用kill -9 pid
查看执行效果:
2.2Flume搭建及配置
在vin01机器上,搭建并配置flume,新建一个Storm_project.conf文件,配置如下:
#exec source - memory channel - kafka sink/hdfs sink
a1.sources = r1
a1.sinks = kafka_sink hdfs_sink
a1.channels = c1 c2
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/vin/project_workspace/nginx.log //此处之前配置成~/...,然后一直运行不成功
# kafka_sink
a1.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.kafka_sink.topic = nginxlog
a1.sinks.kafka_sink.brokerList = vin01:9092
a1.sinks.kafka_sink.requiredAcks = 1
a1.sinks.kafka_sink.batchSize = 20
a1.sinks.kafka_sink.channel = c1
# hdfs_sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = /flume/events/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = nginx_log-
a1.sinks.hdfs_sink.hdfs.fileType = DataStream
a1.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10