Python 去除字符串中的emoji符号，及所有4字节utf8字符

本文链接：https://blog.csdn.net/TomorrowAndTuture/article/details/143745696

起因

事出有因，最近使用 load data local infile 往 mysql 数据库表导入数据的时候，偶然发现有下列报错，导致数据导入失败：

pymysql.err.OperationalError: (1300, "Invalid utf8 character string: 'xxx'")

分析查看原文件，发现里边有一些 emoji 表情符号，类似于下面这种：

Hello 👋 this is a text with 😀 some emo🉑ji🈵s!

而我 mysql 服务端的默认编码是 utf8_general_ci ，这个编码是不支持这些 4 字节的 utf8 字符的。如果要支持这些字符，就要修改编码为 utf8mb4_general_ci。

想到这些特殊字符对我来说并没有什么用，为了不影响其他数据的正常导入，于是乎准备找个方法把这些特殊字符去掉。

emoji 去除

emoji 的编码范围比较分散，单独自己去查找编码范围会比较麻烦。

如果仅仅是去除其中的 emoji 字符，可以 pip install emoji 安装 emoji 库，使用其中的 replace_emoji 去除 emoji 字符即可。

import emoji

if __name__ == '__main__':
    text = "Hello 👋 this is a text with 😀 some emo🉑ji🈵s!"
    print(text)
    clean_text = emoji.replace_emoji(text, '')
    print(clean_text)

Hello 👋 this is a text with 😀 some emo🉑ji🈵s!
Hello  this is a text with  some emojis!

当然，也可以找出文本中有哪些 emoji 字符。

import emoji

if __name__ == '__main__':
    text = "Hello 👋 this is a text with 😀 some emo🉑ji🈵s!"
    print(text)
    emoji_list = emoji.distinct_emoji_list(text)
    print(emoji_list)

Hello 👋 this is a text with 😀 some emo🉑ji🈵s!
['👋', '😀', '🈵', '🉑']

4字节utf8字符去除

但是有些字符并不是 emoji 字符，比如："𨑳"，这种的4字节字符也无法存储到 utf8_general_ci 编码的数据库表。

emoji 本身也是4字节的 utf8 字符，把所有四字节 utf8 字符干掉的话，自然就干掉了其中的 emoji 字符了，虽然方式有所不一样，但殊途同归，都能解决问题。

import re


def remove_utf8mb4_characters(text):
    # 这个正则表达式匹配任何超出基本多文种平面的字符（utf8mb4中的特殊字符）
    pattern = re.compile(r'[\U00010000-\U0010FFFF]')
    return pattern.sub('', text)


if __name__ == '__main__':
    text = "Hello 👋 this is a text with 😀 so𨑳me emo🉑ji🈵s!"
    print(text)
    clean_text = remove_utf8mb4_characters(text)
    print(clean_text)

Hello 👋 this is a text with 😀 so𨑳me emo🉑ji🈵s!
Hello  this is a text with  some emojis!