正则表达式

最新推荐文章于 2025-06-25 17:29:08 发布

GaGa，

最新推荐文章于 2025-06-25 17:29:08 发布

阅读量62

点赞数 1

CC 4.0 BY-SA版权

文章标签：正则表达式网络爬虫 pycharm

本文链接：https://blog.csdn.net/qq_55909023/article/details/134640049

正则表达式

1.正则表达式基础

正则表达式（Regular Expression，简称正则或RegExp）是一种强大且灵活的文本匹配工具，用于在字符串中搜索、匹配和操作文本。正则表达式由字符和操作符构成，它定义了一种规则，描述了一个字符串的特征模式，可以用于字符串的搜索、替换、验证等操作。

常用普通字符的含义如下：

普通字符	含义
\W	匹配非数字、字母、下划线、汉字
\w	匹配数字、字母、下划线、汉字
\S	匹配任意非空白字符
\s	匹配任意空白字符
\D	匹配非数字
\d	匹配数字

元字符是指在正则表达式中的具有特殊含义的专用字符。常用的元字符的含义如下：

元字符	含义
.	匹配任意字符（除换行符 \r 、\n）
^	匹配字符串的开始位置
$	匹配字符串的结束位置
*	匹配该元字符的前一个字符的任意次数（包括 0 次）
?	匹配该元字符的前一个字符 0 次或 1 次
\	转义字符，其后的一个元字符失去特殊含义，匹配字符本身
( )	（）中的表达称为一个组，组匹配到的字符能被取出
[ ]	字符集，范围内的所有字符都能被匹配到
\|	将匹配条件进行逻辑或运算

下面写一段代码进行举例：

import re

# 示例字符串
sample_string = "Hello World! 123"

# \W: 匹配任何非单词字符
pattern = re.compile(r'\W')
result = pattern.findall(sample_string)
print(result)  # 输出: [' ', '!', ' ']

# \w: 匹配任何单词字符
pattern = re.compile(r'\w+')
result = pattern.findall(sample_string)
print(result)  # 输出: ['Hello', 'World', '123']

# \S: 匹配任何非空白字符
pattern = re.compile(r'\S+')
result = pattern.findall(sample_string)
print(result)  # 输出: ['Hello', 'World!', '123']

# \s: 匹配任何空白字符
pattern = re.compile(r'\s')
result = pattern.findall(sample_string)
print(result)  # 输出: [' ', ' ']

# \D: 匹配任何非数字字符
pattern = re.compile(r'\D+')
result = pattern.findall(sample_string)
print(result)  # 输出: ['Hello World! ']

# \d: 匹配任何数字字符
pattern = re.compile(r'\d+')
result = pattern.findall(sample_string)
print(result)  # 输出: ['123']

# ^: 匹配输入字符串的开始
pattern = re.compile(r'^Hello')
result = pattern.match(sample_string)
print(result.group() if result else "No match")  # 输出: 'Hello' 或 'No match'

# *: 匹配前面的元素零次或多次
pattern = re.compile(r'lo*')
result = pattern.findall(sample_string)
print(result)  # 输出: ['l', 'lo', 'l']

# $: 匹配输入字符串的结束
pattern = re.compile(r'World$')
result = pattern.search(sample_string)
print(result.group() if result else "No match")  # 输出: 'World' 或 'No match'

# .: 匹配除换行符 \n 之外的任意字符
pattern = re.compile(r'W..ld')
result = pattern.search(sample_string)
print(result.group() if result else "No match")  # 输出: 'World' 或 'No match'

# ?: 匹配前面的元素零次或一次
pattern = re.compile(r'\d?')
result = pattern.findall(sample_string)
print(result)  # 输出: ['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '2', '3', '']

# []: 定义一个字符集，匹配集合中的任意一个字符
pattern = re.compile(r'[aeiou]')
result = pattern.findall(sample_string)
print(result)  # 输出: ['e', 'o', 'o']

# (): 创建一个捕获组，用于提取匹配的子字符串
pattern = re.compile(r'(\d+)-(\w+)')
result = pattern.match("123-abc")
print(result.groups() if result else "No match")  # 输出: ('123', 'abc') 或 'No match'

# |: 逻辑 OR，匹配两个模式之一
pattern = re.compile(r'cat|dog')
result = pattern.search("I love cats and dogs")
print(result.group() if result else "No match")  # 输出: 'cat' 或 'dog' 或 'No match'

2.用正则表达式提取数据

1.findall():

findall() 函数用于在字符串中找到正则表达式匹配的所有子串，并以列表形式返回这些匹配。

import re

# 示例字符串
text = "Hello 123, Bye 456, Greetings 789"

# 匹配数字
pattern = r'\d+'
matches = re.findall(pattern, text)
print(matches)  # 输出: ['123', '456', '789']

2. search():

search() 函数用于在字符串中搜索第一个匹配的子串，并返回一个匹配对象（match object）。匹配后用group()函数取值。

import re

# 示例字符串
text = "Hello World"

# 查找 "World" 是否在字符串中
pattern = r'World'
match = re.search(pattern, text)
if match:
    print("Found:", match.group())  # 输出: Found: World
else:
    print("Not found")

3. match():

match() 函数用于从字符串的开头匹配正则表达式模式，并返回一个匹配对象（match object）。

import re

# 示例字符串
text = "Hello World"

# 从开头匹配 "Hello"
pattern = r'Hello'
match = re.match(pattern, text)
if match:
    print("Matched:", match.group())  # 输出: Matched: Hello
else:
    print("Not matched")

4. finditer():

finditer() 函数用于在字符串中找到所有匹配的子串，并返回一个迭代器，每个迭代元素都是一个匹配对象（match object）。

import re

# 示例字符串
text = "Hello 123, Bye 456, Greetings 789"

# 匹配数字的迭代器
pattern = r'\d+'
matches_iter = re.finditer(pattern, text)
for match in matches_iter:
    print("Found:", match.group())  # 输出: Found: 123, Found: 456, Found: 789