Python爬虫网安-beautiful soup+示例

目录

beautiful soup:

解析器:

节点选择器:

嵌套选择:

关联选择:

子节点:

子孙节点:

父节点:

祖先节点:

兄弟节点:

上一个兄弟节点:

下一个兄弟节点:

后面所有的兄弟节点:

前面所有的兄弟节点:

方法选择器:

CSS选择器:


beautiful soup:

bs4
用于解析html and xml文档
解析器:html.parser、lxml解析器和XML的内置解析器
文档遍历:跟xpath差不多,也是整理成树形结构
搜索:find() find_all()
修改:增删改查bs4都支持
提取数据
处理特殊字符

解析器:

html.parser(内置) 还行
lxml 速度比较快
xml  速度比较快 
html5lib 用的比较少   速度慢

pip install bs4

#导入 beautiful soup库
from bs4 import BeautifulSoup

#创建一个BeautifulSoup对象,用于解析html文档
soup = BeautifulSoup('<p>hello</p>','lxml')

print(soup.p.string)
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')
#使用prettify()
print(soup.prettify())

节点选择器:

直接调用节点名既可选择该节点。
如果有多个标签,使用标签名来打印该节点的话,就是打印第一个
类型:
<class 'bs4.element.Tag'>
属性:
name:表示标签的名字

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</p></b>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
#选择html文档中的title标签
# print(soup.title)
#打印标签下的文本内容
# print(soup.title.string)
# print(soup.head)

# print(soup.p)

#查看数据类型
# print(type(soup.title))
#用name属性获取节点的名称
# print(soup.head.name)
#获取属性 id class
#attrs
# print(soup.p.attrs)
# print(soup.p.attrs['name'])

print(soup.p['name'])
print(soup.p['class'])

嵌套选择:

返回<class 'bs4.element.Tag'>,可以继续下一步的选择

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""

soup = BeautifulSoup(html,'lxml')

#打印 html文档里面的 title标签
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

关联选择:

有时候我们选择的时候不能一部到位,需要先选择某节点,然后选择 他的 父节点,子节点

子节点:

soup.p.children

子孙节点:

soup.p.descendants

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
#输出一个列表,里面包含p节点的所有子节点 包括文本
# print(soup.p.contents)

# print(soup.p.children)
# for i,child in enumerate(soup.p.children):
#     # print(i)
#     print(child)
#打印所有的子孙节点
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

父节点:

soup.a.parent

祖先节点:

soup.a.parents

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
#打印父节点
# print(soup.a.parent)
#打印祖先节点:
print(soup.a.parents)
print(list(enumerate(soup.a.parents)))

兄弟节点:

上一个兄弟节点:

soup.a.previous_sibling

下一个兄弟节点:

soup.a.next_sibling

后面所有的兄弟节点:

soup.a.next_siblings

前面所有的兄弟节点:

soup.a.previous_siblings

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
    </body>
"""

soup = BeautifulSoup(html,'lxml')
#输出下一个兄弟节点
# print(soup.a.next_sibling)
#输出上一个兄弟节点
# print(soup.a.previous_sibling)
#输出所有后续的兄弟节点:
# print(list(enumerate(soup.a.next_siblings)))
#输出所有前面的兄弟节点:
print(list(enumerate(soup.a.previous_siblings)))
from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        </p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.a.next_sibling.string)

# print(soup.a.parents)
# print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

方法选择器:

find_all,
find方法

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html,'lxml')

#查找所有的ul元素
# print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[1]))

#遍历所有的ul
for ul in soup.find_all(name='ul'):
    #打印所有ul下的li元素
    # print(type(ul.find_all(name='li')))
    for li in ul.find_all(name='li'):
        print(li.string)
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html,'lxml')
#使用find_all查询所有既有id属性为'list-2'的元素
print(soup.find_all(attrs={'id':'list-2'}))

# print(soup.find_all(attrs={'class':'element'}))
print(soup.find_all(class_='element'))
from bs4 import BeautifulSoup
import re

html = """
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
"""

soup = BeautifulSoup(html,'lxml')
#使用正则表达式查询所有带有link的标签
found_elements = soup.find_all(string=re.compile('link'))
print(found_elements)
from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html,'lxml')
ul_tag = soup.find(name='ul')
print(ul_tag)

# list_element = soup.find(class_='list')
# print(list_element)

CSS选择器:

调用select方法,传入对应的css

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html,'lxml')

# print(soup.select('.panel .panel-heading'))

# print(soup.select('ul li'))

# print(soup.select('#list-2 .element'))

# for ul in soup.select('ul'):
#     # print(ul.select('li'))
#     # print(ul['id'])
#     print(ul.attrs['id'])

for li in soup.select('li'):
    # print(li.string)
    print(li.get_text())

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值