h5py快速入门

最新推荐文章于 2025-06-25 09:26:51 发布

zhishidi

最新推荐文章于 2025-06-25 09:26:51 发布

阅读量419

点赞数 5

CC 4.0 BY-SA版权

分类专栏： ai笔记文章标签： python

本文链接：https://blog.csdn.net/zhishidi/article/details/146089363

ai笔记专栏收录该内容

9 篇文章

订阅专栏

h5py 是 Python 中用于读写 HDF5 文件格式的库。HDF5（Hierarchical Data Format）是一种高效存储和管理大规模科学数据的文件格式，支持复杂的分层数据结构、元数据和压缩，特别适合处理多维数组（如图像、数值模拟结果、机器学习模型权重等）。

常见用途

存储和读取大型数据集（如 NumPy 数组）。
保存深度学习模型参数（如 TensorFlow/PyTorch 模型）。
科学数据的持久化（如物理实验数据、天文观测数据）。
高效处理不适合内存的超大数据（通过分块读写和压缩）。

常用案例与代码示例

1. 创建 HDF5 文件并写入数据

import h5py
import numpy as np

# 创建 HDF5 文件（自动关闭）
with h5py.File("data.h5", "w") as f:
    # 创建一个名为 "dataset1" 的 3x3 数据集
    data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    dset = f.create_dataset("dataset1", data=data)

    # 添加属性（元数据）
    dset.attrs["description"] = "Example dataset"
    dset.attrs["author"] = "John Doe"

    # 创建一个分组并写入数据
    group = f.create_group("group1")
    group.create_dataset("dataset2", data=np.random.rand(5))

2. 读取 HDF5 文件

with h5py.File("data.h5", "r") as f:
    # 读取数据集
    dset = f["dataset1"]
    print(dset[:])  # 输出数组内容

    # 读取元数据
    print(dset.attrs["description"])  # 输出 "Example dataset"

    # 遍历文件结构
    def print_objects(name, obj):
        print(f"Name: {name}, Type: {type(obj)}")
    f.visititems(print_objects)

3. 存储和加载机器学习数据集

# 存储图像数据集
with h5py.File("images.h5", "w") as f:
    images = np.random.rand(1000, 64, 64, 3)  # 假设 1000 张 64x64 RGB 图像
    labels = np.random.randint(0, 10, 1000)
    f.create_dataset("images", data=images, compression="gzip")
    f.create_dataset("labels", data=labels)

# 按需加载部分数据（避免内存不足）
with h5py.File("images.h5", "r") as f:
    batch_images = f["images"][0:100]  # 仅加载前 100 张
    batch_labels = f["labels"][0:100]

4. 处理大型数组（分块存储）

with h5py.File("big_data.h5", "w") as f:
    # 分块存储：适合处理超大数据
    dset = f.create_dataset("big_array", shape=(1000000, 1000), dtype='float32',
                            chunks=(1000, 1000), compression="lzf")
    # 逐块写入数据
    for i in range(1000):
        dset[i*1000:(i+1)*1000] = np.random.rand(1000, 1000)

5. 复杂数据结构的组织

with h5py.File("experiment.h5", "w") as f:
    # 在分组中嵌套数据集和子分组
    group = f.create_group("experiment1")
    group.create_dataset("temperature", data=np.array([25.5, 26.0, 24.8]))
    subgroup = group.create_group("sensors")
    subgroup.create_dataset("sensor1", data=np.array([0.1, 0.2, 0.3]))