最近工作上需要对荔枝网进行搜索解析,在研究过程中,发现了一些有意思(WTF)的问题,希望能给踩到坑的同学一点帮助。文中主要涉及请求表单formdata的格式问题及加密参数x-itouchtv-ca-signature获取进行介绍,文末附有完整代码~
抓包接口URL:https://gdtv-api.gdtv.cn/api/search/v1/news
一、POST请求formdata问题
把URL、headers、formdata复制到postman请求没有问题,但是在pycharm上用代码实现发送请求,一直报{“errorCode”:401,“errorMessage”:“Unauthorized access!”}。反反复复改formdata,怎么都是不行,最终发现问题出现在这两处地方:
根据'content-type': 'application/json'请求的data一般是经过json.dumps()编码为json字符串:
{"keyword": "青岛", "pageNum": 1, "type": -1, "pageSize": 20}
直接给出最终可以的格式:
b'{"keyword":"\xe9\x9d\x92\xe5\xb2\x9b","pageNum":1,"type":-1,"pageSize":20}'
坑1:需要去掉字符串中的空格
坑2:需要进行encode('utf8')编码
写了多年的爬虫,直呼异类!!荔枝后端站出来听我说谢谢你。。
请求代码和加密一并放在下面
二、headers中x-itouchtv-ca-signature加密解析
有一处需要说的是,逆向解析js代码,发现参数是采用md5、hmac最终base64加密,但此处的md5必须采用js代码,如果用python进行md5会和js不一样,话不多说直接上代码:
"""
荔枝网搜索关键词
"""
import time
import hashlib
import requests
import json
import hmac
import base64
import execjs
def get_md5(url):
if isinstance(url, str):
url = url.encode("utf-8")
m = hashlib.md5()
m.update(url)
return m.hexdigest()
def get_sign(data):
# node = execjs.get()
ctx = execjs.compile(open('demo.js', encoding='utf-8').read())
funcName = """get_signature('{}')""".format(data)
pwd = ctx.eval(funcName)
return pwd
def get_signature(data):
method = 'POST'
target_url = 'https://gdtv-api.gdtv.cn/api/search/v1/news'
ts_ms = int(time.time()*1000)
# ts_ms = 1656665381266
data_sign = get_sign(data)
# print(data_sign)
message_text = '\n'.join([method, target_url, str(ts_ms), data_sign])
# message_text = 'POST\nhttps://gdtv-api.gdtv.cn/api/search/v1/news\n1656398423671\nazD1QFB7kwa3tQtzyp92fQ=='
CONST_KEY = 'dfkcY1c3sfuw0Cii9DWjOUO3iQy2hqlDxyvDXd1oVMxwYAJSgeB6phO8eW1dfuwX'
hmac_obj = hmac.new(key=CONST_KEY.encode(), msg=message_text.encode(), digestmod=hashlib.sha256)
# print(hmac_obj.digest())
signature = base64.b64encode(hmac_obj.digest()).decode()
# print(signature)
return ts_ms, signature
def run():
formdata = '{"keyword":"青岛","pageNum":1,"type":-1,"pageSize":20}'
timestamp, signature = get_signature(formdata)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
'x-itouchtv-ca-key': '89541443007807288657755311869534',
'x-itouchtv-ca-signature': signature,
'x-itouchtv-ca-timestamp': str(timestamp),
'x-itouchtv-client': 'WEB_PC',
'x-itouchtv-device-id': 'WEB_3ee50450-f38c-11ec-914c-f7b7f9f4989b',
'content-type': 'application/json',
}
resp = requests.post('https://gdtv-api.gdtv.cn/api/search/v1/news', data=formdata.encode('utf-8'), headers=headers).json()
for index, i in enumerate(resp['newsItems']['list']):
print(index, i['title'])
run()
demo.js
const md5 = require("crypto-js/md5");
const hmac_sha256 = require("crypto-js/hmac-sha256");
const base64 = require("crypto-js/enc-base64");
const CryptoJS = require("crypto-js");
function get_signature(payload) {
return CryptoJS.enc.Base64.stringify(md5(payload));
}