通常我们会收集 nginx 的 access_log 然后对日志进行解析,从而得到日志中的每个字段数据,解析日志需要使用到正则表达式,假设 nginx log_format 如下:
$remote_addr - $remote_user [$time_local] "$request" "$upstream_addr" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_x_forwarded_for" "$args" "$server_name" "$http_X_Real_Scheme" "$scheme" "$request_time"
则提取日志内容的正则表达式为:
(.*?)\\s+-\\s+(.*?)\\s+\\[(.*?)\\]\\s+\"(.*?)\\s+(.*?)\\s+(.*?)\"\\s+\"(.*?)\"(\\d+)\\s+(\\d+)\\s+\"(.*?)\"\\s+\"(.*?)\"\\s+\"(.*?)\"\\s+\"(.*?)\"\\s+\"(.*?)\"\\s+\"(.*?)\"\\s+\"(.*?)\"\\s+\"(.*?)\"
以这条日志为例:
192.168.111.46 - - [02/Feb/2021:00:00:38 +0800] "GET /api/social/vl/message/getMsgSummary?companyId=276&nocache=1612195235735 HTTP/1.1" "192.168.104.93:8090"200 274 "https://a.b.com/index.html" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148" "223.104.65.164, 150.138.154.239, 120.27.173.10, 100.121.247.67" "companyId=276&nocache=1612195235735" "a.b.com" "-" "http" "0.017"
运行后解析结果如下:
1:192.168.111.46
2:-
3:01/Feb/2021:16:46:38 +0800
4:GET
5:/api/social/vl/message/getMsgSummary?companyId=276&nocache=1612169196810
6:HTTP/1.1
7:192.168.104.103:8090
8:200
9:274
10:https://a.b.com/index.html
11:Mozilla/5.0 (Linux; Android 10; PACM00 Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/77.0.3865.92 Mobile Safari/537.36 CYTYXY/7.19.1
12:117.183.157.154, 58.222.57.75, 118.178.15.105, 100.121.247.92
13:companyId=276&nocache=1612169196810
14:a.b.com
15:-
16:http
17:0.019
一共提取出17个变量,根据下标就可以得到想要的字段了。