The next step involves doing some preprocessing for our caption data and building a vocabulary or metadata dictionary for our captions. We start by reading in our training dataset records and writing a function to preprocess our text captions:
train_df = pd.read_csv('image_train_dataset.tsv', delimiter='\t')
total_samples = train_df.shape[0]
total_samples
35000
# function to pre-process text captions
def preprocess_captions(caption_list):
pc = []
for caption in caption_list:
caption = caption.strip().lower()
caption = caption.replace('.', '').replace(',',
'').replace("'", "").replace('"', '')
caption = caption.replace('&','and').replace...