TensorFlow中层API：TFRecord的数据导入

最新推荐文章于 2024-07-16 09:09:45 发布

jinjiajia95

最新推荐文章于 2024-07-16 09:09:45 发布

阅读量383

点赞数

CC 4.0 BY-SA版权

分类专栏： tensorflow 文章标签： TFRecord tensorflow

本文链接：https://blog.csdn.net/weixin_40746796/article/details/99441692

tensorflow 专栏收录该内容

2 篇文章

订阅专栏

本文探讨了在大规模数据集上使用TFRecord格式的优势，包括数据处理速度的提升和存储空间的节省。通过对比常规数据处理方法，详细介绍了TFRecord的存储流程，包括不同数据类型的处理技巧，以及如何构建和读取TFRecord文件。此外，还提供了文本分类模型使用TFRecord的实际案例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

为什么用TFRecord？

在数据集较小时，我们会把数据全部加载到内存里方便快速导入，但当数据量超过内存大小时，就只能放在硬盘上来一点点读取，这时就不得不考虑数据的移动、读取、处理等速度。使用TFRecord就是为了提速和节约空间的。

一、数据说明：

假设要学习文本类型的分类模型。我们会事先搜集各个类别的文本信息，用这些信息作为判断类别的依据。同时也会把文本真实的类别信息记录下来。

1、常规方式： 用python代码来进行batch，shuffle，padding等numpy类型的数据处理，再用placeholder + feed_dict来将其导入到graph中变成tensor类型。

##feed_dict##
input_x = [1, 3, 5, ..., 4]#shape(1, seq_length)
input_y = [1, 0, 0, ..., 0]#shape(1, num_classes)
##分类模型placeholder##
self.input_x = tf.placeholder(tf.int32,
            [None, self.config.seq_length], name='input_x')
self.input_y = tf.placeholder(tf.float32,
            [None, self.config.num_classes], name='input_y')

2、TFRecord存储： TFRecord是以字典的方式一次写一个样本，字典的keys可以不以输入和标签，而以不同的特征区分，在随后的读取中再选择哪些特征形成输入，哪些形成标签。这样的好处是，后续可以根据需要只挑选特定的特征；也可以方便应对例如多任务学习这样有多个输入和标签的机器学习任务。
注：一般而言，单数的feature是一个维度，即标量。所有的features组成representation。但在 TFRecord的存储中，字典中feature的value可以不是标量。

2.1 打开TFRecord file

writer = tf.python_io.TFRecordWriter('%s.tfrecord' %'test')

2.2 创建样本写入字典

这里准备一个样本一个样本的写入TFRecord file中。
先把每个样本中所有feature的信息和值存到字典中，key为feature名，value为feature值。
feature值需要转变成tensorflow指定的feature类型中的一个：

2.2.1. 存储类型

int64：tf.train.Feature(int64_list = tf.train.Int64List(value=输入))
float32：tf.train.Feature(float_list = tf.train.FloatList(value=输入))
string：tf.train.Feature(bytes_list=tf.train.BytesList(value=输入))
注：输入必须是list(向量)

2.2.2 如何处理类型是张量的feature

tensorflow feature类型只接受list数据，但如果数据类型是矩阵或者张量该如何处理？

两种方式：

转成list类型：将张量fatten成list(也就是向量)，再用写入list的方式写入。
转成string类型：将张量用.tostring()转换成string类型，再用tf.train.Feature(bytes_list=tf.train.BytesList(value=[input.tostring()]))来存储。
形状信息：不管那种方式都会使数据丢失形状信息，所以在向该样本中写入feature时应该额外加入shape信息作为额外feature。shape信息是int类型，这里我是用原feature名字+’_shape’来指定shape信息的feature名。

# 这里我们将会写3个样本，每个样本里有4个feature：标量，向量，矩阵，张量
for i in range(3):
    # 创建字典
    features={}
    # 写入标量，类型Int64，由于是标量，所以"value=[scalars[i]]" 变成list
    features['scalar'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[scalars[i]]))
    
    # 写入向量，类型float，本身就是list，所以"value=vectors[i]"没有中括号
    features['vector'] = tf.train.Feature(float_list = tf.train.FloatList(value=vectors[i]))
    
    # 写入矩阵，类型float，本身是矩阵，一种方法是将矩阵flatten成list
    features['matrix'] = tf.train.Feature(float_list = tf.train.FloatList(value=matrices[i].reshape(-1)))
    # 然而矩阵的形状信息(2,3)会丢失，需要存储形状信息，随后可转回原形状
    features['matrix_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=matrices[i].shape))
    
    # 写入张量，类型float，本身是三维张量，另一种方法是转变成字符类型存储，随后再转回原类型
    features['tensor']         = tf.train.Feature(bytes_list=tf.train.BytesList(value=[tensors[i].tostring()]))
    # 存储丢失的形状信息(806,806,3)
    features['tensor_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=tensors[i].shape))

2.3 转成tf_features

# 将存有所有feature的字典送入tf.train.Features中
    tf_features = tf.train.Features(feature= features)

2.4 转成tf_example

# 再将其变成一个样本example
    tf_example = tf.train.Example(features = tf_features)

2.5 序列化样本

# 序列化该样本
    tf_serialized = tf_example.SerializeToString()

2.6 写入样本

# 写入一个序列化的样本
    writer.write(tf_serialized)
    # 由于上面有循环3次，所以到此我们已经写了3个样本

2.7 关闭TFRecord file

# 关闭文件    
writer.close()

文本分类模型TFRecord的数据导入整体程序：

#导入库包
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.contrib.keras as kr
#生成数据
'''
假设已经把文本数据进行切词处理，得到
sentences = [[C1, C2, C3, ..., CN], [], ...]#shape(batch_size, N)
假设 label = kr.utils.to_categorical(labs)
label = [[1, 0, 0, ..., 0], [], ...]#shape(batch_size, num_classes)
vocab为文本的词汇字典，vocab = {'<UNK>:0, C1:1, C2:2, ...},大小为vocab_size
'''
##存储类型int64##
def _int64_feature(value):
  """Helper for creating an Int64 Feature."""
  return tf.train.Feature(int64_list=tf.train.Int64List(
      value=[int(v) for v in value]))
##存储类型float32##     
def _float32_feature(value):
  """Helper for creating an float32 Feature."""
  return tf.train.Feature(float_list=tf.train.FloatList(
      value=[float(v) for v in value]))
##将文本序列化             
def _sentence_to_ids(sentence, vocab):
  """Helper for converting a sentence (list of words) to a list of x."""
  x = [vocab.get(w, 0) for w in sentence]
  x = kr.preprocessing.sequence.pad_sequences([x], max_sentence_length, padding='post', truncating='post')
  return x[0]
  
def _create_serialized_example(sentence, label, vocab):
  """Helper for creating a serialized Example proto.
  将存有所有feature的字典送入tf.train.Features中,再将其变成一个样本example"""
  
    example = tf.train.Example(features=tf.train.Features(feature={
         "input_x": _int64_feature(_sentence_to_ids(sentence, vocab)),
         "input_y": _float32_feature(label),
     }))

  return example.SerializeToString()#序列化该样本

def _process_input_file(sentences, label, vocab):
 """Processes the sentences in an input file.

 Args:
   sentences: pre-tokenized list
   label: one-hot label
   vocab: A dictionary of word to id.
 Returns:
   processed: A list of serialized Example protos
 """
     processed = []
     for i in range(len(sentences)):
         if max_sentence_length and len(sentences[i) >= max_sentence_length:
             sentence = sentence[:max_sentence_length]
         else:
             sentence = sentences[i]
         serialized = _create_serialized_example(sentence, label[i], vocab)
         processed.append(serialized)
 return processed

def _write_shard(filename, dataset, indices):
  """Writes a TFRecord shard."""
  with tf.python_io.TFRecordWriter(filename) as writer:
    for j in indices:
      writer.write(dataset[j])


def _write_dataset(name, dataset, indices, num_shards, output_dir):
  """Writes a sharded TFRecord dataset.

  Args:
    dataset: List of serialized Example protos.
    indices: List of indices of 'dataset' to be written.
    num_shards: The number of output shards.
    name:train/vaild
  """
    borders = np.int32(np.linspace(0, len(indices), num_shards + 1))
    for i in range(num_shards):
        filename = os.path.join(output_dir, "%s-%.5d-of-%.5d" % (name, i,
                                                                   num_shards))
        shard_indices = indices[borders[i]:borders[i + 1]]
        _write_shard(filename, dataset, shard_indices)

def main(sentences, label, vocab, output_dir):
'''
num_validation_sentences: 验证数据量，例如20000条
train_output_shards：将训练数据分割成train_output_shards个文件，例如100个文件
validation_output_shards：将验证数据分割成validation_output_shards个文件，例如10个文件
'''
   dataset = []
   dataset.extend(_process_input_file(sentences, label, vocab))
   np.random.seed(123)
   shuffled_indices = np.random.permutation(len(dataset))
   val_indices = shuffled_indices[:num_validation_sentences]
   train_indices = shuffled_indices[num_validation_sentences:]
   _write_dataset("train", dataset, train_indices, train_output_shards, output_dir)
   _write_dataset("validation", dataset, val_indices,
                 validation_output_shards, output_dir)