前言
- Python是一门计算机程序语言,目前人工智能科学领域应用广泛,应用广泛就表明各种库,各种相关联的框架都是以Python作为主要语言开发出来的。
- 谷歌的TensorFlow大部分代码都是Python
- Python虽然是脚本语言,但是因为容易学,迅速成为科学家的工具,从而积累了大量的工具库、架构,人工智能涉及大量的数据计算,用Python是很自然的,简单高效。
- Python有非常多优秀的深度学习库可用,现在大部分深度学习框架都支持Python,不用Python用谁?
环境以及工具
pyCharm 2018.3.3
Python 3.7.2
Google Chrome 版本 71.0.3578.98(正式版本)(64 位)
效果展示
..........热门音乐人才分类..........
['豆瓣摄影', '流行', '轻音乐', '摇滚', '古典', '电子', '世界音乐', '民谣', '说唱', '爵士', '原声']
..........热门音乐URL..........
['https://ypy.douban.com', '/artists/genre_page/6/', '/artists/genre_page/2/', '/artists/genre_page/8/', '/artists/genre_page/1/', '/artists/genre_page/3/', '/artists/genre_page/10/', '/artists/genre_page/4/', '/artists/genre_page/7/', '/artists/genre_page/5/', '/artists/genre_page/9/']
..........热门音乐URL进行自定义拼接..........
https://music.douban.com/artists/genre_page/6/
https://music.douban.com/artists/genre_page/2/
https://music.douban.com/artists/genre_page/8/
https://music.douban.com/artists/genre_page/1/
https://music.douban.com/artists/genre_page/3/
https://music.douban.com/artists/genre_page/10/
https://music.douban.com/artists/genre_page/4/
https://music.douban.com/artists/genre_page/7/
https://music.douban.com/artists/genre_page/5/
https://music.douban.com/artists/genre_page/9/
复制代码
整体步骤
1. Requests 发送网络请求
2. etree解析html
3. 爬虫分析处理
核心步骤解读
1. Requests 发送网络请求
from lxml import etree
import requests
url = 'https://music.douban.com/' # 需要爬的网址
page = requests.Session().get(url)
复制代码
2. etree解析html
htmlString = etree.HTML(page.text)
复制代码
3. 爬虫分析处理
网页选中流行,右击
用xpath匹配指定的标签
musicType = htmlString.xpath("//tr//a/text()")
musicUrl = htmlString.xpath("//tr//a/@href")
复制代码
源码展示
# 爬取 豆瓣音乐音乐人分类 https://music.douban.com
from lxml import etree
import requests
url = 'https://music.douban.com/' # 需要爬的网址
page = requests.Session().get(url)
htmlString = etree.HTML(page.text)
# print("page.text \n" + htmlString)
musicType = htmlString.xpath("//tr//a/text()")
musicUrl = htmlString.xpath("//tr//a/@href")
print("..........热门音乐人才分类..........")
print(musicType)
print("..........热门音乐URL..........")
print(musicUrl)
print("..........热门音乐URL进行自定义拼接..........")
for urlIndex in range(len(musicUrl)):
if urlIndex >= 1:
print(musicUrl[0].replace("ypy", "music") + musicUrl[urlIndex])
复制代码
源码位置
请关注公众号并在后台回复:Python
Python爬虫系列分享教程
Kotlin重构系列分享教程
Kotlin重构如何兼容原先的ButterKnife、EventBus3.1
当ButterKnife8.8.1碰到AndroidX怎么办