Python爬虫之旅之豆瓣音乐-CSDN博客

前言

Python是一门计算机程序语言，目前人工智能科学领域应用广泛，应用广泛就表明各种库，各种相关联的框架都是以Python作为主要语言开发出来的。
谷歌的TensorFlow大部分代码都是Python
Python虽然是脚本语言，但是因为容易学，迅速成为科学家的工具，从而积累了大量的工具库、架构，人工智能涉及大量的数据计算，用Python是很自然的，简单高效。
Python有非常多优秀的深度学习库可用，现在大部分深度学习框架都支持Python，不用Python用谁?

环境以及工具

pyCharm 2018.3.3

Python 3.7.2

Google Chrome 版本 71.0.3578.98（正式版本）（64 位）

效果展示

..........热门音乐人才分类..........
['豆瓣摄影', '流行', '轻音乐', '摇滚', '古典', '电子', '世界音乐', '民谣', '说唱', '爵士', '原声']
..........热门音乐URL..........
['https://ypy.douban.com', '/artists/genre_page/6/', '/artists/genre_page/2/', '/artists/genre_page/8/', '/artists/genre_page/1/', '/artists/genre_page/3/', '/artists/genre_page/10/', '/artists/genre_page/4/', '/artists/genre_page/7/', '/artists/genre_page/5/', '/artists/genre_page/9/']
..........热门音乐URL进行自定义拼接..........
https://music.douban.com/artists/genre_page/6/
https://music.douban.com/artists/genre_page/2/
https://music.douban.com/artists/genre_page/8/
https://music.douban.com/artists/genre_page/1/
https://music.douban.com/artists/genre_page/3/
https://music.douban.com/artists/genre_page/10/
https://music.douban.com/artists/genre_page/4/
https://music.douban.com/artists/genre_page/7/
https://music.douban.com/artists/genre_page/5/
https://music.douban.com/artists/genre_page/9/
复制代码

整体步骤

1. Requests 发送网络请求

2. etree解析html

3. 爬虫分析处理

核心步骤解读

1. Requests 发送网络请求

from lxml import etree
import requests

url = 'https://music.douban.com/'  # 需要爬的网址
page = requests.Session().get(url)
复制代码

2. etree解析html

htmlString = etree.HTML(page.text)
复制代码

3. 爬虫分析处理

网页选中流行，右击

用xpath匹配指定的标签

musicType = htmlString.xpath("//tr//a/text()")
musicUrl = htmlString.xpath("//tr//a/@href")
复制代码

源码展示

# 爬取 豆瓣音乐音乐人分类 https://music.douban.com
from lxml import etree
import requests

url = 'https://music.douban.com/'  # 需要爬的网址
page = requests.Session().get(url)
htmlString = etree.HTML(page.text)
# print("page.text \n" + htmlString)
musicType = htmlString.xpath("//tr//a/text()")
musicUrl = htmlString.xpath("//tr//a/@href")
print("..........热门音乐人才分类..........")
print(musicType)
print("..........热门音乐URL..........")
print(musicUrl)
print("..........热门音乐URL进行自定义拼接..........")
for urlIndex in range(len(musicUrl)):
    if urlIndex >= 1:
        print(musicUrl[0].replace("ypy", "music") + musicUrl[urlIndex])
复制代码