Beautiful Soup

最新推荐文章于 2024-05-01 09:48:59 发布

云飞扬°

最新推荐文章于 2024-05-01 09:48:59 发布

阅读量214

点赞数

CC 4.0 BY-SA版权

分类专栏： Python爬虫文章标签： Beautiful Soup

本文链接：https://blog.csdn.net/weixin_44706512/article/details/99407951

Python爬虫专栏收录该内容

24 篇文章

订阅专栏

本文介绍了如何使用Python库BeautifulSoup从网页中提取信息，涵盖了安装、常用对象如BeautifulSoup、Tag、NavigableString和Comment的使用，以及find_all和find函数的示例。同时，详细解释了子节点、父节点和兄弟节点的概念。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、介绍及安装

Beautiful Soupn是一个可以从网页中提取信息的Python库

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装

pip install beautifulsoup4

导入

from bs4 import BeautifulSoup

二、四大常用对象

1-BeautifulSoup对象

BeautifulSoup(text)----表示的是一个文档的全部内容

prettify()---按照标准的缩进格式的结构输出

get_text()---会将HTML文档中的所有标签清除，返回一个只包含文字的字符串

from bs4 import BeautifulSoup

text = '''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
'''

# # 创建一个Beautiful Soup对象
soup = BeautifulSoup(text, 'lxml')
print(soup)

# 按照标准的缩进格式输出
r = soup.prettify()
print(r)

# 获取所有文字内容
r = soup.get_text()
print(r)
输出结果：
Harry Potter
29.99


Learning XML
39.95

2-Tag

Tag是HTML中的一个标签，有name和attrs两个属性

from bs4 import BeautifulSoup

text = '''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
'''

# Tag
tag = soup.title
print(tag)
输出结果：
<title lang="eng">Harry Potter</title>

# name属性
print(tag.name)
输出结果：
title

# 获取标签的属性
print(tag.attrs)
输出结果：
{'lang': 'eng'}

# 单独获取某个属性
r = soup.title['lang']
print(r)
输出结果：
eng

3-NavigableString

获取标签中的文字

# NavigableString
print(tag.string)  # 获取标签中的文字
输出结果：
Harry Potter

4-Comment

Comment 对象是一个特殊类型的 NavigableString 对象

输出的内容中不包括注释符号

# Comment对象
string = '<p><!--这是注释！--></p>'
sp = BeautifulSoup(string, 'lxml')
print(sp.p)  # 获取p标签
输出结果：
<p><!--这是注释！--></p>

# 获取标签中的文字
print(sp.p.string)
输出结果：
这是注释！


print(type(sp))-------<class 'bs4.BeautifulSoup'>

print(type(sp.p))------<class 'bs4.element.Tag'>

print(type(sp.p.string))------<class 'bs4.element.Comment'>

三、两个常用函数

1-find_all()--返回结果是值包含一个元素的列表

2-find()--直接返回第一个结果

from bs4 import BeautifulSoup

text = '''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
'''

# 创建一个Beautiful Soup对象
soup = BeautifulSoup(text, 'lxml')

# 两个常用函数
r1 = soup.find_all('title')
print(r1)
输出结果：
[<title lang="eng">Harry Potter</title>, <title lang="eng">Learning XML</title>]

r2 = soup.find('title')
print(r2)
输出结果：
<title lang="eng">Harry Potter</title>

find_all()

r3 = soup.find_all('title', lang='eng')   # 查找指定的标签
print(r3)
输出结果：
[<title lang="eng">Harry Potter</title>, <title lang="eng">Learning XML</title>]


# 获取符合条件的结果中的第1条
r4 = soup.find_all('title', lang='eng')[0]
print(r4)
输出结果：
<title lang="eng">Harry Potter</title>

# 获取符合条件的结果中的第1条中的文本
r5 = soup.find_all('title', lang='eng')[0].get_text()
print(r5)
输出结果：
Harry Potter

r6 = soup.find_all(['title', 'price'])
print(r6)
输出结果：
[<title lang="eng">Harry Potter</title>, <price>29.99</price>, 
 <title lang="eng">Learning XML</title>, <price>39.95</price>]

四、三大常见节点

1-子节点：一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点

2-父节点：每个tag或字符串都有父节点:被包含在某个tag中

3-兄弟节点：平级的节点

from bs4 import BeautifulSoup

text = '''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
'''

# 创建一个Beautiful Soup对象
soup = BeautifulSoup(text, 'lxml')

# ***********子节点***********

# 获取book标签的信息
r6 = soup.book
print(r6)
输出结果：
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>

# 获取book标签下的所有子节点
r7 = soup.book.contents
print(r7)
输出结果：
['\n', <title lang="eng">Harry Potter</title>, '\n', <price>29.99</price>, '\n']

# 获取book标签下子节点的第4个元素
r8 = soup.book.contents[3]
print(r8)
输出结果：
<price>29.99</price>

# ***********父节点**************
# 父节点
r = soup.title.parent
print(r)
输出结果：
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>

r = soup.book.parent
print(r)
输出结果：
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>

# .parents,获取全部父节点
for parent in soup.title.parents:
    print(parent.name)
输出结果：
book
bookstore
body
html
[document]

# 兄弟节点
# .next_sibling，获取该节点的下一个兄弟节点
r = soup.book.next_sibling
print(r)
输出结果：
'\n'

# .next_sibling，获取该节点的上一个兄弟节点
r = soup.book.previous_sibling
print(r)
输出节点:
'\n'

# .next_siblings:获取全部兄弟节点
for sibling in soup.title.next_siblings:
    print(sibling)

输出结果：
'\n','\n','<price>29.99</price>'