Python 之大文本文件分块读取

Looooking

已于 2025-05-09 15:27:58 修改

阅读量583

点赞数 2

CC 4.0 BY-SA版权

分类专栏： Python 文章标签： python

于 2024-11-08 15:17:03 首次发布

本文链接：https://blog.csdn.net/TomorrowAndTuture/article/details/143626024

Python 专栏收录该内容

123 篇文章

订阅专栏

常规读取

常规读取一般可以使用 open 的 readlines() 方法读取文件所有行到内存当中，文件小的话尚且可以应付，文件太大的话则对内存会造成很大的压力，内存开销太大。

def read_content(filename):
    try:
        with open(filename, encoding='utf-8') as f:
            result = f.readlines()
            return result
    except Exception as e:
        print(f"文件 {filename} 读取报错：{str(e)}")
        return []

分块读取

pandas

利用 pandas 的 read_csv 方法的 chunksize 参数去分块读取。

csv 文件和一般的纯文本的 txt 文件本质上没太大区别，所以完全可以用 pandas 读取 csv 的方式去读纯文本文件。

import pandas as pd


def write_test(filename):
    lines = []
    for i in range(100):
        lines.append(f"hello,world{i + 1}\n")
    with open(filename, 'w', encoding='utf-8') as f:
        f.writelines(lines)


def read_content(filename, chunksize=10):
    try:
        result = pd.read_csv(filename, chunksize=chunksize, header=None, delimiter='\001')
        # result = pd.read_csv(filename, chunksize=10, header=None)
        return result
    except Exception as e:
        print(f"文件 {filename} 读取报错：{str(e)}")
        return []


if __name__ == '__main__':
    filename = "./test.txt"
    # write_test(filename)
    chunk_count = 0
    result = read_content(filename)
    for item in result:
        chunk_count += 1
        print(f"===chunk count {chunk_count}===")
        values = [list(value) for value in item.values]
        for value in values:
            print(value[-1])
        # print(item.values)
        print("===chunk data process end===")
        print()

===chunk count 1===
hello,world1
hello,world2
hello,world3
hello,world4
hello,world5
hello,world6
hello,world7
hello,world8
hello,world9
hello,world10
===chunk data process end===

===chunk count 2===
hello,world11
hello,world12
hello,world13
hello,world14
hello,world15
hello,world16
hello,world17
hello,world18
hello,world19
hello,world20
===chunk data process end===

...

readline

open 的 readline() 方法本身就是一个迭代器，可以利用 readline() 来实现分块读取。

def write_test(filename):
    lines = []
    for i in range(100):
        lines.append(f"hello,world{i + 1}\n")
    with open(filename, 'w', encoding='utf-8') as f:
        f.writelines(lines)


def read_content(filename, chunksize=10):
    try:
        with open(filename, encoding='utf-8') as f:
            result = []
            line = f.readline()
            while line:
                result.append(line)
                if len(result) == chunksize:
                    yield result
                    result = []
                line = f.readline()
            if result:
                yield result
    except Exception as e:
        print(f"文件 {filename} 读取报错：{str(e)}")
        return []


if __name__ == '__main__':
    filename = "./test.txt"
    # write_test(filename)
    chunk_count = 0
    result = read_content(filename)
    for item in result:
        print(item)

['hello,world1\n', 'hello,world2\n', 'hello,world3\n', 'hello,world4\n', 'hello,world5\n', 'hello,world6\n', 'hello,world7\n', 'hello,world8\n', 'hello,world9\n', 'hello,world10\n']
['hello,world11\n', 'hello,world12\n', 'hello,world13\n', 'hello,world14\n', 'hello,world15\n', 'hello,world16\n', 'hello,world17\n', 'hello,world18\n', 'hello,world19\n', 'hello,world20\n']
['hello,world21\n', 'hello,world22\n', 'hello,world23\n', 'hello,world24\n', 'hello,world25\n', 'hello,world26\n', 'hello,world27\n', 'hello,world28\n', 'hello,world29\n', 'hello,world30\n']
...

列表均分

很多时候，我们可能需要将列表 lst 的数据均分成 n 多份，然后进行多线程并发调用。

def split_list(lst: list, n: int) -> list:
    """将 lst 均分成 n 个列表（最后一个列表中元素的数目会根据实际输入有所变动）"""
    result = []
    length = len(lst)
    # 如果能整除，步长使用 length // n
    # 如果不能整除，步长使用 length // n 后，分成 n 个列表后数据还有剩余，此时可以将步长 step + 1，
    if length % n == 0:
        step = length // n
    else:
        step = length // n + 1
    for i in range(n):
        result.append(lst[i * step: (i + 1) * step])
    return result


if __name__ == '__main__':
    lst = [1, 2, 3, 4, 5, 6, 7]
    n = 2
    result = split_list(lst, n)
    print(result)
    # [[1, 2, 3, 4], [5, 6, 7]]