Author: 净好
阅前必读:
A榜笔记直达:Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——笔记之初闯A榜
目录
一、数据统计
在打B榜前,是不是和我一样都很好奇,为什么A榜的能跑得那么高分?前十名的准确度都能达到0.999+。
1.为什么A榜能跑到那么高分?
如上图所示,A榜当中前十名的准确度已经到了0.999这种程度,令人咂舌的同时不禁让人困惑,为什么A榜的高手能跑出来那么高分呢?
首先,强劲的模型和恰到好处的调参是必须的,而更多的,我们不妨先统计一下train.csv和test.csv里面有多少不同的样本。
# 选择train和test中不同的样本出来
import pandas as pd
# 读取train和test
train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract','Keywords','label'])
test = pd.read_csv('../data_preprocess/test.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract','Keywords'])
# 以test为主,挑出train和test不同的样本,样本包含'title',''author','abstract','Keywords'
test_diff = {'uuid':[], 'title':[], 'author':[], 'abstract':[], 'Keywords':[]}
for i in range(len(test['title'])):
if test['title'][i] not in train['title'].values:
test_diff['uuid'].append(test['uuid'][i])
test_diff['title'].append(test['title'][i])
test_diff['author'].append(test['author'][i])
test_diff['abstract'].append(test['abstract'][i])
test_diff['Keywords'].append(test['Keywords'][i])
test_diff = pd.DataFrame(test_diff)
test_diff.to_csv('../data_preprocess/test_diff.csv', index=False)
根据上面的代码,只需print下len(test_diff['uuid']),即可知道train.csv和test.csv里面有696条不同的样本。也就是说,train.csv和test.csv里面相同的样本有1662条!(2358-696=1662),考虑到训练集内的样本总数为6000条,假设这696条样本的分布与train.csv的分布偏移不大,再加上本次比赛从6月9号便已开启,所以A榜能跑到0.999+的分数不足为奇。
2.Ngram的N取多少时,最具性价比?
当我们做关键词提取任务时,如果我们用的不是大模型微调的方法,那么我们肯定会涉及到Ngram。也就是确定关键词词长度的范围。
对Ngram的简单理解:
如句子:To good health, every day to exercise. 里面,good health这两个单词相邻且数量为2,所以它是Bi-gram,也就是N取2。every day to exercise这四个单词相邻且数量为4,所以它是N取4。
那么当我们考虑要去用TF-IDF、TextRank、Keybert或yake这些方法来对文献的title和abstract提取关键词时,我们就要设置一个Ngram,N值的选取范围,要是选的范围太小,分数可能会很低,要是选的范围太大,算法可能会跑到天荒地老。(Keybert我没有在说你)(这就是Baseline2为什么要先用TF-IDF的方法挑选出一堆候选词后再用embedding算余弦相似度的原因吗?)
所以,到底Ngram的N要取多少的时候,最具性价比呢?
在此我提供一个思路:先确定训练集和测试集里面关键词的最大连续长度,再统计训练集和测试集里面,各长度关键词的占比是怎么样的,做完上述一切后,我们再见机行事。
# 统计关键词的最大连续长度
import pandas as pd
train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract','Keywords','label'])
test = pd.read_csv('../data_preprocess/test_diff.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract','Keywords'])
max_num = 0
for i in range(len(train['Keywords'])):
keywords = train['Keywords'][i].split(';')
for keyword in keywords:
keyword_linear_word = keyword.split(' ')
if len(keyword_linear_word) > max_num:
max_num = len(keyword_linear_word)
for i in range(len(test['Keywords'])):
keywords = train['Keywords'][i].split(';')
for keyword in keywords:
keyword_linear_word = keyword.split(' ')
if len(keyword_linear_word) > max_num:
max_num = len(keyword_linear_word)
print(f'关键词的最大单词长度:{max_num}')
# 关键词的最大连续长度:19
运行上面的代码后,可以知道关键词的最大连续长度为19。
下面,我们再来统计训练集和测试集里面,各长度关键词的占比是怎么样的:
# 统计测试长度从1到19个分词所组成的关键词里面,占比是怎么样的
import pandas as pd
train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract','Keywords','label'])
test = pd.read_csv('../data_preprocess/test_diff.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract','Keywords'])
# 创建一个列表来存储各个长度的关键词
keywords_len_list = [0] * 20 # 从1到19
for i in range(len(train['Keywords'])):
keywords = train['Keywords'][i].split(';')
for keyword in keywords:
keyword_linear_word = keyword.split(' ')
keywords_len_list[len(keyword_linear_word)] += 1
for i in range(len(test['Keywords'])):
keywords = train['Keywords'][i].split(';')
for keyword in keywords:
keyword_linear_word = keyword.split(' ')
keywords_len_list[len(keyword_linear_word)] += 1
keywords_len_odds = [keywords_len_list[i] / sum(keywords_len_list) for i in range(len(keywords_len_list))]
for i in range(len(keywords_len_odds)):
print('关键词长度为%d的所占的比例为%f' % (i, keywords_len_odds[i]))
程序运行结果:
我们可以观察一下程序的运行结果,可知当我们只考虑Ngram的N取(2,3,4)时,性价比是最高的,算法既不会运行很久,所选出来的关键词也差不多够用。
3.只对已知文本内容做关键词提取,最高能得多少分?
# 统计各数据集样本总数和相同样本的个数
import pandas as pd
def statics_repeat_names(c_name):
train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',', names=['uuid','keywords','author','abstract','Keywords','label'])
test = pd.read_csv('../data_preprocess/test.csv', header=0, delimiter=',', names=['uuid','keywords','author','abstract','Keywords'])
testB = pd.read_csv('../data_preprocess/testB.csv', header=0, delimiter=',', names=['uuid','keywords','author','abstract'])
# 统计train.csv中的keywords
train_keywords = []
for i in range(len(train[c_name])):
train_keywords.append(train[c_name][i])
# 统计test.csv中的keywords
test_keywords = []
for i in range(len(test[c_name])):
test_keywords.append(test[c_name][i])
if c_name != 'Keywords':
# 统计testB.csv中的keywords
testB_keywords = []
for i in range(len(testB[c_name])):
testB_keywords.append(testB[c_name][i])
else:
testB_keywords = []
# train.csv中的keywords总数
total_train_keywords = len(train_keywords)
# test.csv中的keywords总数
total_test_keywords = len(test_keywords)
# testB.csv中的keywords总数
if c_name != 'Keywords':
total_testB_keywords = len(testB_keywords)
else:
total_testB_keywords = 2000
# 统计train.csv与test.csv keywords相同的个数
same_keywords = 0
for word in train_keywords:
if word in test_keywords:
same_keywords += 1
# 统计train.csv与testB.csv keywords相同的个数
same_keywords_B = 0
for word in train_keywords:
if word in testB_keywords:
same_keywords += 1
print(f'train.csv中的样本总数:{total_train_keywords}')
print(f'test.csv中的样本总数:{total_test_keywords}')
print(f'testB.csv中的样本总数:{total_testB_keywords}')
print(f'train.csv与test.csv {c_name}相同的个数:{same_keywords}')
print(f'train.csv与testB.csv {c_name}相同的个数:{same_keywords_B}\n')
if __name__ == '__main__':
statics_repeat_names('keywords')
statics_repeat_names('author')
statics_repeat_names('abstract')
statics_repeat_names('Keywords')
二、我所用的方法
我针对比赛的两个任务采用分而治之的方法,而不是用Baseline2或大语言模型直接完成两个任务。
我最终在B榜得分为0.564,所采用的方法是:模型集成→医学文献分类 + ChatGLM2的lora微调→关键词提取
1.医学文献分类
(A榜笔记直达:Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——笔记之初闯A榜 )
在打A榜的时候,我在最后主要用的方法是模型集成,但是在A榜我犯了一个错误,就是没有在本地评估过模型后,再选择表现最好的模型去推理并提交。并且在集成7个模型后,因为初次尝试的效果太好,导致我在不停“堆料”,花了大量的时间将集成的模型个数上升到15个,但是效果却还不如第一次集成的7个模型好。最后在本地将数据集划分为5:5来训练评估模型,发现后面新加的那8个模型完全没有必要(太桑心了),表现最好的是bert-large-uncased_embedding+LSTM, deberta-v3-base_embedding+LSTM,deberta-v3-large_embedding+LSTM,roberta-base_embedding+LSTM这四个模型的集成。(因为考虑到训练的时间成本,所以不选择直接将预训练模型接分类器来微调,而是直接将数据集中每条样本的文本内容转为预训练模型[CLS]768维的embedding表示再接入LSTM和分类器来训练模型,我用Nvidia 2080大概1~2分钟即可训练完成一个模型)
下面是我训练和评估集成模型的代码。
1.将训练集划分为训练集:验证集 = 5:5
# 将训练集划分为训练集:验证集 = 5:5
import pandas as pd
df = pd.read_csv('../data_preprocess/train.csv', delimiter=',', header=0, names=['uuid', 'title', 'author', 'abstract', 'Keywords', 'label'])
df = df['label']
df = df.to_numpy()
# 划分一半
df1 = df[:int(len(df)/2)]
df2 = df[int(len(df)/2):]
# 保存
df1 = pd.DataFrame(df1)
df1.to_csv('./train_0.5.csv', index=False, header=False)
df2 = pd.DataFrame(df2)
df2.to_csv('./val_0.5.csv', index=False, header=False)
2.将数据集中每条样本的文本内容转为预训练模型[CLS]768维的embedding表示
# 将训练集和测试集的csv文件转换为npy文件
import pandas as pd
import numpy as np
from text2vec import SentenceModel
import torch
import tqdm
import warnings
warnings.filterwarnings('ignore')
# test数据集里面 uuid为536的缺少abstract -> 手动补充完毕
# 数据预处理
def process(name, model_name, model_index, include_keywords=True):
df = pd.read_csv(name, delimiter=',', header=0, names=['uuid', 'title', 'author', 'abstract', 'Keywords', 'label'])
df = df.dropna() # 这一步的作用是删除空行
# 将title, author, abstract, Keywords合并为content
if include_keywords:
df['content'] = df['title'] + df['author'] + df['abstract'] + df['Keywords']
else:
df['content'] = df['title'] + df['author'] + df['abstract']
df = {'uuid':df['uuid'], 'content':df['content'], 'label':df['label']}
# 加载预训练词向量
# vectors = SentenceModel(device='cuda' if torch.cuda.is_available() else 'cpu') # 使用Bert-uncased的词向量
if 'longformer' in model_name:
vectors = SentenceModel(model_name_or_path=f"../code/premodel/{model_name}", max_seq_length=648, device='cuda' if torch.cuda.is_available() else 'cpu') # 256 使用shibing624/text2vec-base-chinese
else:
vectors = SentenceModel(model_name_or_path=f"../code/premodel/{model_name}", max_seq_length=512, device='cuda' if torch.cuda.is_available() else 'cpu') # 256 使用shibing624/text2vec-base-chinese
print('正在将文本转换为向量中...')
for i in tqdm.trange(len(df['content'])):
sentence = df['content'][i]
# 对句子内的文本进行编码
sentence_vectors = vectors.encode(sentence)
df['content'][i] = sentence_vectors
# 将所有数据转换为npy向量: 1.内容向量 2.标签(train)
content_vectors = np.array(df['content'])
label_vectors = np.array(df['label'])
# 划分为5:5的训练集和验证集
content_vectors1 = np.array_split(content_vectors, 2)[0]
label_vectors1 = np.array_split(label_vectors, 2)[0]
content_vectors2 = np.array_split(content_vectors, 2)[1]
label_vectors2 = np.array_split(label_vectors, 2)[1]
np.save(f'./embedding/x_train_0.5_{model_index}.npy', content_vectors1)
np.save(f'./embedding/y_train_0.5_{model_index}.npy', label_vectors1)
np.save(f'./embedding/x_val_0.5_{model_index}.npy', content_vectors2)
np.save(f'./embedding/y_val_0.5_{model_index}.npy', label_vectors2)
print(f'{model_index}.{model_name} done.')
if __name__ == '__main__':
# model = ['bert-base-uncased', 'bert-large-uncased',
# 'deberta-v3-base', 'deberta-v3-large',
# 'roberta-base', 'roberta-large',
# 'longformer-base-4096']
# for model_index, model_name in enumerate(model):
# process('../data_preprocess/train.csv', model_name, model_index+1, include_keywords=False)
# 单独加入 8 albert-xxlarge-v2
# process('../data_preprocess/train.csv', 'albert-xxlarge-v2', 8, include_keywords=False)
# 单独加入 9 potsawee/longformer-large-4096-answering-race
# process('../data_preprocess/train.csv', 'longformer-large-4096-answering-race', 9, include_keywords=False)
# 10 多语言 - e5-large
# process('../data_preprocess/train.csv', 'multilingual-e5-large', 10, include_keywords=False)
# 11 多语言 - base
# process('../data_preprocess/train.csv', 'paraphrase-multilingual-mpnet-base-v2', 11, include_keywords=False)
# 12 医学roberta
# process('../data_preprocess/train.csv', 'nuclear_medicine_DARoBERTa', 12, include_keywords=False)
# 13 多语言 - e5-base
# process('../data_preprocess/train.csv', 'multilingual-e5-base', 13, include_keywords=False)
# 14 多语言 - e5-small
# process('../data_preprocess/train.csv', 'multilingual-e5-small', 14, include_keywords=False)
# 15 英语 - e5-large-v2
process('../data_preprocess/train.csv', 'e5-large-v2', 15, include_keywords=False)
3.训练各个模型
train_embedding = train_embedding.detach().numpy() for i, data in enumerate(val_loader): data = data[0].to('cpu') outputs, embedding = model(data) if i == 0: val_embedding = embedding val_outputs = outputs else: val_embedding = torch.cat([val_embedding, embedding], dim=0) val_outputs = torch.cat([val_outputs, outputs], dim=0) val_embedding = val_embedding.detach().numpy() # 保存val_outputs,预测概率 if not opt.use_BCE: val_outputs = torch.softmax(val_outputs, dim=1) torch.save(val_outputs, f'./pred_prob/{model_index}_prob.pth') def run(model_index): # 随机种子设置 seed = opt.seed torch.seed = seed np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True train_dataset, val_dataset = load_data(model_index) # 打印数据集信息 print('-数据集信息:') print(f'-训练集样本数:{len(train_dataset)},测试集样本数:{len(val_dataset)}') train_labels = len(set(train_dataset.tensors[1].numpy())) # 看一下各类样本数目是否均衡 print(f'-训练集的标签种类个数为:{train_labels}') numbers = [0] * train_labels for i in train_dataset.tensors[1].numpy(): numbers[i] += 1 print(f'-训练集各种类样本的个数:') for i in range(train_labels): print(f'-{i}的样本个数为:{numbers[i]}') batch_size = opt.batch_size # 迭代器 train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) val_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False) val_Predict_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=1, shuffle=False) best_model = model_pretrain(model_index, train_loader, val_loader) # 使用验证集来进行模型的评估 model_predict(best_model, model_index, train_loader, val_Predict_loader) if __name__ == '__main__': # for i in range(1, 8): # run(i) # 单独训练 # run(8) # run(9) # run(10) # run(11) # run(13) # run(14) run(15)
4.对各个模型进行不同的组合并评估
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn import metrics
import matplotlib.pyplot as plt
import os
import time
import warnings
warnings.filterwarnings('ignore')
# 超参数设置
class opt:
seed = 100 # 100
batch_size = 64
set_epoch = 300 # 20
early_stop = 15 # About training process
consideration_num = 5 # About model saving
learning_rate = 1e-3 #1e-3
weight_decay = 2e-6 # L2 regularization 2e-6
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gpu_num = 1
use_data_au = False
use_BCE = False
models = ['bert-base-uncased', 'bert-large-uncased',
'deberta-v3-base', 'deberta-v3-large',
'roberta-base', 'roberta-large',
'longformer-base-4096', 'albert-xxlarge-v2', 'longformer-large-4096-answering-race',
'multilingual-e5-large', 'paraphrase-multilingual-mpnet-base-v2', 'nuclear_medicine_DARoBERTa',
'multilingual-e5-base', 'multilingual-e5-small', 'e5-large-v2']
# 模型定义
class LSTM(nn.Module):
def __init__(self):
super(LSTM, self).__init__()
# LSTM
hidden_size = 256
self.lstm1 = nn.LSTM(input_size=768, hidden_size=hidden_size, bidirectional=False)
self.lstm2 = nn.LSTM(input_size=1024, hidden_size=hidden_size, bidirectional=False)
self.lstm3 = nn.LSTM(input_size=4096, hidden_size=hidden_size, bidirectional=False)
self.lstm4 = nn.LSTM(input_size=384, hidden_size=hidden_size, bidirectional=False)
self.act = nn.Tanh() #nn.Tanh()
self.classifier1 = nn.Linear(hidden_size, 2)
self.classifier2 = nn.Linear(hidden_size, 1)
def forward(self, x):
try:
x, _ = self.lstm1(x)
except:
try:
x, _ = self.lstm2(x)
except:
try:
x, _ = self.lstm3(x)
except:
x, _ = self.lstm4(x)
e = x
x = self.act(x)
if not opt.use_BCE:
x = self.classifier1(x)
else:
x = self.classifier2(x)
return x, e
# 一、数据加载
def load_data(model_index):
train_data_path = f'./embedding/x_train_0.5_{model_index}.npy'
train_label_path = f'./embedding/y_train_0.5_{model_index}.npy'
val_data_path = f'./embedding/x_val_0.5_{model_index}.npy'
val_label_path = f'./embedding/y_val_0.5_{model_index}.npy'
train_data = torch.tensor(np.load(train_data_path , allow_pickle=True).tolist()).float()
train_label = torch.tensor(np.load(train_label_path , allow_pickle=True).tolist()).long()
val_data = torch.tensor(np.load(val_data_path , allow_pickle=True).tolist()).float()
val_label = torch.tensor(np.load(val_label_path , allow_pickle=True).tolist()).long()
train_dataset = torch.utils.data.TensorDataset(train_data , train_label)
val_dataset = torch.utils.data.TensorDataset(val_data, val_label)
return train_dataset, val_dataset
# 二、模型预训练
def model_pretrain(model_index, train_loader, val_loader):
# 超参数设置
set_epoch = opt.set_epoch
early_stop = opt.early_stop
consideration_num = opt.consideration_num
learning_rate = opt.learning_rate
weight_decay = opt.weight_decay
device = opt.device
gpu_num = opt.gpu_num
# 判断最佳模型是否已经存在,若存在则直接读取,若不存在则进行预训练
if os.path.exists(f'checkpoints/best_model{model_index}.pth'):
try:
best_model = LSTM()
best_model.load_state_dict(torch.load(f'checkpoints/best_model{model_index}.pth'))
return best_model
except:
pass
else:
pass
model_save_dir = 'checkpoints'
models = opt.models
model_name = models[model_index-1]
result_save_dir = 'results'
# 模型加载
model = LSTM().to(device)
# Optimizer loading
if device != 'cpu' and gpu_num > 1:
optimizer = torch.optim.NAdam(model.module.parameters(), lr=learning_rate, weight_decay=weight_decay)
optimizer = torch.nn.DataParallel(optimizer, device_ids=list(range(gpu_num)))
else:
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
# 损失函数加载
if opt.use_BCE:
loss_func = nn.BCEWithLogitsLoss()
else:
loss_func = nn.CrossEntropyLoss()
# 模型训练
# -模型训练这部分的代码来源于我的Github项目:Torcheasy - https://github.com/Zeaulo/Torcheasy
best_epoch = 0
best_train_loss = 100000
train_acc_list = []
train_loss_list = []
fixed_val_acc = None
best_model_loss = 100000
models_list = [i for i in range(consideration_num)]
start_time = time.time()
for epoch in range(set_epoch):
model.train()
train_loss = 0
train_acc = 0
for i, (x, y) in enumerate(train_loader):
x = x.to(device)
y = y.to(device)
outputs, embedding = model(x)
if opt.use_BCE:
loss = loss_func(outputs, y.float().unsqueeze(1))
else:
loss = loss_func(outputs, y)
train_loss += loss.item()
optimizer.zero_grad()
loss.backward()
if device != 'cpu' and gpu_num > 1:
optimizer.module.step()
else:
optimizer.step()
if not opt.use_BCE:
_, predicted = torch.max(outputs.data, 1)
else:
predicted = (outputs > 0.5).int()
predicted = predicted.squeeze(1)
train_acc += (predicted == y).sum().item()
average_mode = 'binary'
train_f1 = metrics.f1_score(y.cpu(), predicted.cpu(), average=average_mode)
train_pre = metrics.precision_score(y.cpu(), predicted.cpu(), average=average_mode)
train_recall = metrics.recall_score(y.cpu(), predicted.cpu(), average=average_mode)
train_loss /= len(train_loader)
train_acc /= len(train_loader.dataset)
train_acc_list.append(train_acc)
train_loss_list.append(train_loss)
print('-'*50)
print('Epoch [{}/{}]\n Train Loss: {:.4f}, Train Acc: {:.4f}'.format(epoch + 1, set_epoch, train_loss, train_acc))
print('Train-f1: {:.4f}, Train-precision: {:.4f} Train-recall: {:.4f}'.format(train_f1, train_pre, train_recall))
# evaluate val
model.eval()
with torch.no_grad():
val_acc = 0
val_loss = 0
for i, (x, y) in enumerate(val_loader):
x = x.to(device)
y = y.to(device)
outputs, embedding = model(x)
if opt.use_BCE:
loss = loss_func(outputs, y.float().unsqueeze(1))
else:
loss = loss_func(outputs, y)
val_loss += loss.item()
if not opt.use_BCE:
_, predicted = torch.max(outputs.data, 1)
else:
predicted = (outputs > 0.5).int()
predicted = predicted.squeeze(1)
val_acc += (predicted == y).sum().item()
val_loss /= len(val_loader)
val_acc /= len(val_loader.dataset)
print('Val Loss: {:.4f}, Val Acc: {:.4f}'.format(val_loss, val_acc))
# Choose the best model for saving1
# (epoch+1)%save_num -> Replace the old model in the list
models_list[(epoch+1)%consideration_num] = model
if epoch+1 >= consideration_num:
# -save_num:-1 -> The model score list is always in the last few numbers
models_list_loss = train_loss_list[-consideration_num:-1]
# model_loss -> model_loss is the average loss of the model list
# The purpose is to find the most stable model list
model_loss = np.mean(models_list_loss)
if model_loss <= best_model_loss:
best_model_loss = model_loss
# Choose the best model in the model list -> The best model in the most stable model list
perfect = np.argmin(models_list_loss)
print(f'-> best_model_loss {model_loss} has accessed, saving model...')
fixed_val_acc = val_acc
if device == 'cuda' and gpu_num > 1:
torch.save(models_list[perfect].module.state_dict(), f'{model_save_dir}/best_model{model_index}.pth')
else:
torch.save(models_list[perfect].state_dict(), f'{model_save_dir}/best_model{model_index}.pth')
if train_loss < best_train_loss:
best_train_loss = train_loss
best_epoch = epoch + 1
# Early stopping
if epoch+1 - best_epoch == early_stop:
print(f'{early_stop} epochs later, the loss of the validation set no longer continues to decrease, so the training is stopped early.')
end_time = time.time()
print(f'Total time is {end_time - start_time}s.')
# write fixed_val_acc to result
with open(f'{result_save_dir}/{model_name}_result.txt', 'a') as f:
f.write(f'{model_name} {model_index} {fixed_val_acc}\n')
break
# Draw the accuracy and loss function curves of the training set and the validation set
plt.figure()
plt.plot(range(1, len(train_acc_list) + 1), train_acc_list, label='train_acc')
plt.legend()
plt.savefig(f'{result_save_dir}/{model_name}_acc.png')
plt.close()
plt.figure()
plt.plot(range(1, len(train_loss_list) + 1), train_loss_list, label='train_loss')
plt.legend()
plt.savefig(f'{result_save_dir}/{model_name}_loss.png')
plt.close()
best_model = LSTM()
best_model.load_state_dict(torch.load(f'checkpoints/best_model{model_index}.pth'))
return best_model
# 三、模型推理
def model_predict(model, model_index, train_loader, val_loader):
model.to('cpu')
y_train_labels = []
train_embedding = []
val_embedding = []
val_outputs = None
for i, (data, label) in enumerate(train_loader):
data = data.to('cpu')
_, embedding = model(data)
if i == 0:
train_embedding = embedding
y_train_labels = label
else:
train_embedding = torch.cat([train_embedding, embedding], dim=0)
y_train_labels = torch.cat([y_train_labels, label], dim=0)
train_embedding = train_embedding.detach().numpy()
for i, data in enumerate(val_loader):
data = data[0].to('cpu')
outputs, embedding = model(data)
if i == 0:
val_embedding = embedding
val_outputs = outputs
else:
val_embedding = torch.cat([val_embedding, embedding], dim=0)
val_outputs = torch.cat([val_outputs, outputs], dim=0)
val_embedding = val_embedding.detach().numpy()
# 保存val_outputs,预测概率
if not opt.use_BCE:
val_outputs = torch.softmax(val_outputs, dim=1)
torch.save(val_outputs, f'./pred_prob/{model_index}_prob.pth')
def run(model_index):
# 随机种子设置
seed = opt.seed
torch.seed = seed
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
train_dataset, val_dataset = load_data(model_index)
# 打印数据集信息
print('-数据集信息:')
print(f'-训练集样本数:{len(train_dataset)},测试集样本数:{len(val_dataset)}')
train_labels = len(set(train_dataset.tensors[1].numpy()))
# 看一下各类样本数目是否均衡
print(f'-训练集的标签种类个数为:{train_labels}')
numbers = [0] * train_labels
for i in train_dataset.tensors[1].numpy():
numbers[i] += 1
print(f'-训练集各种类样本的个数:')
for i in range(train_labels):
print(f'-{i}的样本个数为:{numbers[i]}')
batch_size = opt.batch_size
# 迭代器
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False)
val_Predict_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=1, shuffle=False)
best_model = model_pretrain(model_index, train_loader, val_loader)
# 使用验证集来进行模型的评估
model_predict(best_model, model_index, train_loader, val_Predict_loader)
if __name__ == '__main__':
# for i in range(1, 8):
# run(i)
# 单独训练
# run(8)
# run(9)
# run(10)
# run(11)
# run(13)
# run(14)
run(15)
输出最终的结果(展示前45):
第一次尝试并提交A榜模型集成组合的是1~7,也就是在上图排名第45,而表现最好的模型集成组合是(2,3,4,5),也就是bert-large-uncased_embedding+LSTM, deberta-v3-base_embedding+LSTM,deberta-v3-large_embedding+LSTM,roberta-base_embedding+LSTM这四个模型的集成。
2.关键词提取
由于以前没做过关键词提取这个任务,并且在B榜开启之后,我还花费了一天的时间(24号)来捣鼓医学文献分类的模型集成,在分类任务上确保能拿到0.369+分数的同时,我在关键词提取任务上剩下的时间非常紧张,所以正如标题所说,我真正打B榜的时间其实只有两天。而在第二天(25号)的时候,我便开始着手关键词提取的研究。使用了TF-IDF、TextRank、yake的方法,表现最好的是yake的方法,能拿到0.43459的分数。
代码:
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
import yake
# 读取testB.csv
testB = pd.read_csv('../data_preprocess/testB.csv', header=0, delimiter=',',
names=['uuid','title','author','abstract'])
# 读取提交文件
submit = pd.read_csv('../submit/Ensemble_LSTM_submitB.csv', header=0, delimiter=',',
names=['uuid','Keywords','label'])
# 加载模型
kw_extractor = yake.KeywordExtractor(n=4, top=40, dedupLim=1)
# 将title和abstract合并在一起
testB['title_abstract'] = testB['title'] + testB['abstract']
# 遍历里面的文本并提取关键词
all_lines_keywords = []
for text in tqdm(testB['title_abstract']):
keywords = kw_extractor.extract_keywords(text)
keywords = [keyword[0] for keyword in keywords]
keywords = '; '.join(keywords)
all_lines_keywords.append(keywords)
# 将提取好的关键词填入进去
for i in range(len(submit['Keywords'])):
submit['Keywords'][i] = all_lines_keywords[i]
# 保存
submit.to_csv('Ensemble_LSTM_yake.csv',index=False)
提交结果:
此时我的排名为第四名,但我的提交次数已达上限。
接下来我打算用文本生成的方法来准备继续提分。
去租了一张NVIDIA 3090后,花了一个下午+一个晚上的时间把环境部署好,代码捣鼓好。
但考虑到lora微调的时间和模型推理2000条样本的时间后,迫不得已正式赛只能止步于此。
等到正式赛过去后,试一下用ChatGLM2(提取前1200个样本的关键词)+yake(提取后800个样本的关键词),然后提交,所得分数为:0.51063
直接用ChatGLM2提取全部样本的关键词,然后提交,所得的分数为:0.564
三、比赛收获
1.第一次参加这种形式的夏令营,认识了很多大佬。
2.第一次只花七天的时间来打比赛(7/20~7/26),很满意自己能在这么短的时间内打出来的成绩。
3.第一次自己用代码实现想要的模型集成,并且取得了理想的效果。
4.第一次做关键词提取任务,学习了各种各样的方法:
①TF-IDF+embedding
②TextRank
③yake
5.第一次使用大语言模型微调这一个方法来打比赛。
6.掌握了更多的比赛提分tricks:
①k折交叉验证
②模型集成
7.积累了更多的NLP比赛的经验:
①要提前准备比赛并能够在本地上验证模型性能
②在不清楚测试集和训练集的分布是否偏移的情况下,可将训练集:验证集划分为5:5,验证集要足够大,对模型的评估结果才能更准确。
③在需要用到大模型微调的时候,需要留出更多的时间
->以上内容修改源于我在Datawhale夏令营的笔记:Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——两天极限打B榜 - 飞书云文档 (feishu.cn)
至此、
谢谢你能阅读到它的结尾。