Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——两天极限打B榜

Author: 净好


阅前必读:

A榜笔记直达:Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——笔记之初闯A榜


目录

一、数据统计

1.为什么A榜能跑到那么高分?

2.Ngram的N取多少时,最具性价比?

3.只对已知文本内容做关键词提取,最高能得多少分?

二、我所用的方法

1.医学文献分类

2.关键词提取

三、比赛收获


一、数据统计

在打B榜前,是不是和我一样都很好奇,为什么A榜的能跑得那么高分?前十名的准确度都能达到0.999+。

1.为什么A榜能跑到那么高分?

如上图所示,A榜当中前十名的准确度已经到了0.999这种程度,令人咂舌的同时不禁让人困惑,为什么A榜的高手能跑出来那么高分呢?

首先,强劲的模型和恰到好处的调参是必须的,而更多的,我们不妨先统计一下train.csv和test.csv里面有多少不同的样本。

# 选择train和test中不同的样本出来
import pandas as pd
# 读取train和test
train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',', 
                names=['uuid','title','author','abstract','Keywords','label'])
test  = pd.read_csv('../data_preprocess/test.csv', header=0, delimiter=',', 
                names=['uuid','title','author','abstract','Keywords'])

# 以test为主,挑出train和test不同的样本,样本包含'title',''author','abstract','Keywords'
test_diff = {'uuid':[], 'title':[], 'author':[], 'abstract':[], 'Keywords':[]}
for i in range(len(test['title'])):
    if test['title'][i] not in train['title'].values:
        test_diff['uuid'].append(test['uuid'][i])
        test_diff['title'].append(test['title'][i])
        test_diff['author'].append(test['author'][i])
        test_diff['abstract'].append(test['abstract'][i])
        test_diff['Keywords'].append(test['Keywords'][i])

test_diff = pd.DataFrame(test_diff)
test_diff.to_csv('../data_preprocess/test_diff.csv', index=False)

根据上面的代码,只需print下len(test_diff['uuid']),即可知道train.csv和test.csv里面有696条不同的样本。也就是说,train.csv和test.csv里面相同的样本有1662条!(2358-696=1662),考虑到训练集内的样本总数为6000条,假设这696条样本的分布与train.csv的分布偏移不大,再加上本次比赛从6月9号便已开启,所以A榜能跑到0.999+的分数不足为奇。

2.Ngram的N取多少时,最具性价比?

当我们做关键词提取任务时,如果我们用的不是大模型微调的方法,那么我们肯定会涉及到Ngram。也就是确定关键词词长度的范围。

对Ngram的简单理解:

如句子:To good health, every day to exercise. 里面,good health这两个单词相邻且数量为2,所以它是Bi-gram,也就是N取2。every day to exercise这四个单词相邻且数量为4,所以它是N取4。

那么当我们考虑要去用TF-IDF、TextRank、Keybert或yake这些方法来对文献的title和abstract提取关键词时,我们就要设置一个Ngram,N值的选取范围,要是选的范围太小,分数可能会很低,要是选的范围太大,算法可能会跑到天荒地老。(Keybert我没有在说你)(这就是Baseline2为什么要先用TF-IDF的方法挑选出一堆候选词后再用embedding算余弦相似度的原因吗?)

所以,到底Ngram的N要取多少的时候,最具性价比呢?

在此我提供一个思路:先确定训练集和测试集里面关键词的最大连续长度,再统计训练集和测试集里面,各长度关键词的占比是怎么样的,做完上述一切后,我们再见机行事。

# 统计关键词的最大连续长度
import pandas as pd

train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',',
         names=['uuid','title','author','abstract','Keywords','label'])
test  = pd.read_csv('../data_preprocess/test_diff.csv', header=0, delimiter=',',
         names=['uuid','title','author','abstract','Keywords'])

max_num = 0
for i in range(len(train['Keywords'])):
    keywords = train['Keywords'][i].split(';')
    for keyword in keywords:
        keyword_linear_word = keyword.split(' ')
        if len(keyword_linear_word) > max_num:
            max_num = len(keyword_linear_word)
for i in range(len(test['Keywords'])):
    keywords = train['Keywords'][i].split(';')
    for keyword in keywords:
        keyword_linear_word = keyword.split(' ')
        if len(keyword_linear_word) > max_num:
            max_num = len(keyword_linear_word)

print(f'关键词的最大单词长度:{max_num}')
# 关键词的最大连续长度:19

运行上面的代码后,可以知道关键词的最大连续长度为19。

下面,我们再来统计训练集和测试集里面,各长度关键词的占比是怎么样的:

# 统计测试长度从1到19个分词所组成的关键词里面,占比是怎么样的
import pandas as pd
train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',',
             names=['uuid','title','author','abstract','Keywords','label'])
test  = pd.read_csv('../data_preprocess/test_diff.csv', header=0, delimiter=',',
             names=['uuid','title','author','abstract','Keywords'])

# 创建一个列表来存储各个长度的关键词
keywords_len_list = [0] * 20 # 从1到19
for i in range(len(train['Keywords'])):
    keywords = train['Keywords'][i].split(';')
    for keyword in keywords:
        keyword_linear_word = keyword.split(' ')
        keywords_len_list[len(keyword_linear_word)] += 1
for i in range(len(test['Keywords'])):
    keywords = train['Keywords'][i].split(';')
    for keyword in keywords:
        keyword_linear_word = keyword.split(' ')
        keywords_len_list[len(keyword_linear_word)] += 1

keywords_len_odds = [keywords_len_list[i] / sum(keywords_len_list) for i in range(len(keywords_len_list))]
for i in range(len(keywords_len_odds)):
    print('关键词长度为%d的所占的比例为%f' % (i, keywords_len_odds[i]))

程序运行结果:

我们可以观察一下程序的运行结果,可知当我们只考虑Ngram的N取(2,3,4)时,性价比是最高的,算法既不会运行很久,所选出来的关键词也差不多够用。

3.只对已知文本内容做关键词提取,最高能得多少分?

# 统计各数据集样本总数和相同样本的个数
import pandas as pd
def statics_repeat_names(c_name):
    train = pd.read_csv('../data_preprocess/train.csv', header=0, delimiter=',', names=['uuid','keywords','author','abstract','Keywords','label'])
    test  = pd.read_csv('../data_preprocess/test.csv', header=0, delimiter=',', names=['uuid','keywords','author','abstract','Keywords'])
    testB = pd.read_csv('../data_preprocess/testB.csv', header=0, delimiter=',', names=['uuid','keywords','author','abstract'])

    # 统计train.csv中的keywords
    train_keywords = []
    for i in range(len(train[c_name])):
        train_keywords.append(train[c_name][i])

    # 统计test.csv中的keywords
    test_keywords = []
    for i in range(len(test[c_name])):
        test_keywords.append(test[c_name][i])

    if c_name != 'Keywords':
        # 统计testB.csv中的keywords
        testB_keywords = []
        for i in range(len(testB[c_name])):
            testB_keywords.append(testB[c_name][i])
    else:
        testB_keywords = []

    # train.csv中的keywords总数
    total_train_keywords = len(train_keywords)

    # test.csv中的keywords总数
    total_test_keywords = len(test_keywords)

    # testB.csv中的keywords总数
    if c_name != 'Keywords':
        total_testB_keywords = len(testB_keywords)
    else:
        total_testB_keywords = 2000

    # 统计train.csv与test.csv keywords相同的个数
    same_keywords = 0
    for word in train_keywords:
        if word in test_keywords:
            same_keywords += 1

    # 统计train.csv与testB.csv keywords相同的个数
    same_keywords_B = 0
    for word in train_keywords:
        if word in testB_keywords:
            same_keywords += 1

    print(f'train.csv中的样本总数:{total_train_keywords}')
    print(f'test.csv中的样本总数:{total_test_keywords}')
    print(f'testB.csv中的样本总数:{total_testB_keywords}')
    print(f'train.csv与test.csv {c_name}相同的个数:{same_keywords}')
    print(f'train.csv与testB.csv {c_name}相同的个数:{same_keywords_B}\n')

if __name__ == '__main__':
    statics_repeat_names('keywords')
    statics_repeat_names('author')
    statics_repeat_names('abstract')
    statics_repeat_names('Keywords')

二、我所用的方法

我针对比赛的两个任务采用分而治之的方法,而不是用Baseline2或大语言模型直接完成两个任务。

我最终在B榜得分为0.564,所采用的方法是:模型集成→医学文献分类 + ChatGLM2的lora微调→关键词提取

1.医学文献分类

A榜笔记直达:Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——笔记之初闯A榜

在打A榜的时候,我在最后主要用的方法是模型集成,但是在A榜我犯了一个错误,就是没有在本地评估过模型后,再选择表现最好的模型去推理并提交。并且在集成7个模型后,因为初次尝试的效果太好,导致我在不停“堆料”,花了大量的时间将集成的模型个数上升到15个,但是效果却还不如第一次集成的7个模型好。最后在本地将数据集划分为5:5来训练评估模型,发现后面新加的那8个模型完全没有必要(太桑心了),表现最好的是bert-large-uncased_embedding+LSTM, deberta-v3-base_embedding+LSTM,deberta-v3-large_embedding+LSTM,roberta-base_embedding+LSTM这四个模型的集成。(因为考虑到训练的时间成本,所以不选择直接将预训练模型接分类器来微调,而是直接将数据集中每条样本的文本内容转为预训练模型[CLS]768维的embedding表示再接入LSTM和分类器来训练模型,我用Nvidia 2080大概1~2分钟即可训练完成一个模型)

下面是我训练和评估集成模型的代码。

1.将训练集划分为训练集:验证集 = 5:5

# 将训练集划分为训练集:验证集 = 5:5
import pandas as pd
df = pd.read_csv('../data_preprocess/train.csv', delimiter=',', header=0, names=['uuid', 'title', 'author', 'abstract', 'Keywords', 'label'])
df = df['label']
df = df.to_numpy()

# 划分一半
df1 = df[:int(len(df)/2)]
df2 = df[int(len(df)/2):]

# 保存
df1 = pd.DataFrame(df1)
df1.to_csv('./train_0.5.csv', index=False, header=False)
df2 = pd.DataFrame(df2)
df2.to_csv('./val_0.5.csv', index=False, header=False)

2.将数据集中每条样本的文本内容转为预训练模型[CLS]768维的embedding表示

# 将训练集和测试集的csv文件转换为npy文件

import pandas as pd
import numpy as np
from text2vec import SentenceModel
import torch
import tqdm
import warnings
warnings.filterwarnings('ignore')

# test数据集里面 uuid为536的缺少abstract -> 手动补充完毕

# 数据预处理
def process(name, model_name, model_index, include_keywords=True):
    df = pd.read_csv(name, delimiter=',', header=0, names=['uuid', 'title', 'author', 'abstract', 'Keywords', 'label'])
    
    df = df.dropna() # 这一步的作用是删除空行
    # 将title, author, abstract, Keywords合并为content
    if include_keywords:
        df['content'] = df['title'] + df['author'] + df['abstract'] + df['Keywords']
    else:
        df['content'] = df['title'] + df['author'] + df['abstract']
        
    df = {'uuid':df['uuid'], 'content':df['content'], 'label':df['label']}

    # 加载预训练词向量
    # vectors = SentenceModel(device='cuda' if torch.cuda.is_available() else 'cpu') # 使用Bert-uncased的词向量
    if 'longformer' in model_name:
        vectors = SentenceModel(model_name_or_path=f"../code/premodel/{model_name}", max_seq_length=648, device='cuda' if torch.cuda.is_available() else 'cpu') # 256 使用shibing624/text2vec-base-chinese
    else:
        vectors = SentenceModel(model_name_or_path=f"../code/premodel/{model_name}", max_seq_length=512, device='cuda' if torch.cuda.is_available() else 'cpu') # 256 使用shibing624/text2vec-base-chinese

    print('正在将文本转换为向量中...')
    for i in tqdm.trange(len(df['content'])):
        sentence = df['content'][i]
        # 对句子内的文本进行编码
        sentence_vectors = vectors.encode(sentence)
        df['content'][i] = sentence_vectors
    # 将所有数据转换为npy向量: 1.内容向量 2.标签(train)
    content_vectors = np.array(df['content'])
    label_vectors = np.array(df['label'])
    # 划分为5:5的训练集和验证集
    content_vectors1 = np.array_split(content_vectors, 2)[0]
    label_vectors1 = np.array_split(label_vectors, 2)[0]
    content_vectors2 = np.array_split(content_vectors, 2)[1]
    label_vectors2 = np.array_split(label_vectors, 2)[1]
    np.save(f'./embedding/x_train_0.5_{model_index}.npy', content_vectors1)
    np.save(f'./embedding/y_train_0.5_{model_index}.npy', label_vectors1)
    np.save(f'./embedding/x_val_0.5_{model_index}.npy', content_vectors2)
    np.save(f'./embedding/y_val_0.5_{model_index}.npy', label_vectors2)

    print(f'{model_index}.{model_name} done.')

if __name__ == '__main__':
    # model = ['bert-base-uncased', 'bert-large-uncased', 
    #          'deberta-v3-base', 'deberta-v3-large', 
    #          'roberta-base', 'roberta-large', 
    #          'longformer-base-4096']
    # for model_index, model_name in enumerate(model):
    #     process('../data_preprocess/train.csv', model_name, model_index+1, include_keywords=False)
    
    # 单独加入 8 albert-xxlarge-v2
    # process('../data_preprocess/train.csv', 'albert-xxlarge-v2', 8, include_keywords=False)
    # 单独加入 9 potsawee/longformer-large-4096-answering-race
    # process('../data_preprocess/train.csv', 'longformer-large-4096-answering-race', 9, include_keywords=False)

    # 10 多语言 - e5-large
    # process('../data_preprocess/train.csv', 'multilingual-e5-large', 10, include_keywords=False)
    # 11 多语言 - base
    # process('../data_preprocess/train.csv', 'paraphrase-multilingual-mpnet-base-v2', 11, include_keywords=False)
    # 12 医学roberta
    # process('../data_preprocess/train.csv', 'nuclear_medicine_DARoBERTa', 12, include_keywords=False)
    # 13 多语言 - e5-base
    # process('../data_preprocess/train.csv', 'multilingual-e5-base', 13, include_keywords=False)
    # 14 多语言 - e5-small
    # process('../data_preprocess/train.csv', 'multilingual-e5-small', 14, include_keywords=False)
    # 15 英语 - e5-large-v2
    process('../data_preprocess/train.csv', 'e5-large-v2', 15, include_keywords=False)

3.训练各个模型

 train_embedding = train_embedding.detach().numpy() for i, data in enumerate(val_loader): data = data[0].to('cpu') outputs, embedding = model(data) if i == 0: val_embedding = embedding val_outputs = outputs else: val_embedding = torch.cat([val_embedding, embedding], dim=0) val_outputs = torch.cat([val_outputs, outputs], dim=0) val_embedding = val_embedding.detach().numpy() # 保存val_outputs,预测概率 if not opt.use_BCE: val_outputs = torch.softmax(val_outputs, dim=1) torch.save(val_outputs, f'./pred_prob/{model_index}_prob.pth') def run(model_index): # 随机种子设置 seed = opt.seed torch.seed = seed np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True train_dataset, val_dataset = load_data(model_index) # 打印数据集信息 print('-数据集信息:') print(f'-训练集样本数:{len(train_dataset)},测试集样本数:{len(val_dataset)}') train_labels = len(set(train_dataset.tensors[1].numpy())) # 看一下各类样本数目是否均衡 print(f'-训练集的标签种类个数为:{train_labels}') numbers = [0] * train_labels for i in train_dataset.tensors[1].numpy(): numbers[i] += 1 print(f'-训练集各种类样本的个数:') for i in range(train_labels): print(f'-{i}的样本个数为:{numbers[i]}') batch_size = opt.batch_size # 迭代器 train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) val_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False) val_Predict_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=1, shuffle=False) best_model = model_pretrain(model_index, train_loader, val_loader) # 使用验证集来进行模型的评估 model_predict(best_model, model_index, train_loader, val_Predict_loader) if __name__ == '__main__': # for i in range(1, 8): # run(i) # 单独训练 # run(8) # run(9) # run(10) # run(11) # run(13) # run(14) run(15) 

4.对各个模型进行不同的组合并评估

import numpy  as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn import metrics
import matplotlib.pyplot as plt
import os
import time
import warnings
warnings.filterwarnings('ignore')

# 超参数设置
class opt:
    seed               = 100 # 100
    batch_size         = 64
    set_epoch          = 300 # 20
    early_stop         = 15 # About training process
    consideration_num  = 5 # About model saving
    learning_rate      = 1e-3 #1e-3
    weight_decay       = 2e-6 # L2 regularization 2e-6
    device             = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    gpu_num            = 1
    use_data_au        = False
    use_BCE            = False
    models = ['bert-base-uncased', 'bert-large-uncased', 
            'deberta-v3-base', 'deberta-v3-large', 
            'roberta-base', 'roberta-large', 
            'longformer-base-4096', 'albert-xxlarge-v2', 'longformer-large-4096-answering-race',
            'multilingual-e5-large', 'paraphrase-multilingual-mpnet-base-v2', 'nuclear_medicine_DARoBERTa',
            'multilingual-e5-base', 'multilingual-e5-small', 'e5-large-v2']

# 模型定义
class LSTM(nn.Module):
    def __init__(self):
        super(LSTM, self).__init__()
        # LSTM
        hidden_size = 256
        self.lstm1   = nn.LSTM(input_size=768, hidden_size=hidden_size, bidirectional=False)
        self.lstm2   = nn.LSTM(input_size=1024, hidden_size=hidden_size, bidirectional=False)
        self.lstm3   = nn.LSTM(input_size=4096, hidden_size=hidden_size, bidirectional=False)
        self.lstm4   = nn.LSTM(input_size=384, hidden_size=hidden_size, bidirectional=False)
        self.act    = nn.Tanh() #nn.Tanh()
        self.classifier1 = nn.Linear(hidden_size, 2)
        self.classifier2 = nn.Linear(hidden_size, 1)

    def forward(self, x):
        try:
            x, _ = self.lstm1(x)
        except:
            try:
                x, _ = self.lstm2(x)
            except:
                try:
                    x, _ = self.lstm3(x)
                except:
                    x, _ = self.lstm4(x)
        e = x
        x = self.act(x)
        if not opt.use_BCE:
            x = self.classifier1(x)
        else:
            x = self.classifier2(x)
        return x, e

# 一、数据加载
def load_data(model_index):
    train_data_path     = f'./embedding/x_train_0.5_{model_index}.npy'
    train_label_path    = f'./embedding/y_train_0.5_{model_index}.npy'
    val_data_path      = f'./embedding/x_val_0.5_{model_index}.npy'
    val_label_path      = f'./embedding/y_val_0.5_{model_index}.npy'
    train_data          = torch.tensor(np.load(train_data_path   , allow_pickle=True).tolist()).float()
    train_label         = torch.tensor(np.load(train_label_path  , allow_pickle=True).tolist()).long()
    val_data           = torch.tensor(np.load(val_data_path    , allow_pickle=True).tolist()).float()
    val_label           = torch.tensor(np.load(val_label_path    , allow_pickle=True).tolist()).long()
    train_dataset       = torch.utils.data.TensorDataset(train_data  , train_label)
    val_dataset        = torch.utils.data.TensorDataset(val_data, val_label)
    return train_dataset, val_dataset

# 二、模型预训练
def model_pretrain(model_index, train_loader, val_loader):
    # 超参数设置
    set_epoch          = opt.set_epoch
    early_stop         = opt.early_stop
    consideration_num  = opt.consideration_num
    learning_rate      = opt.learning_rate
    weight_decay       = opt.weight_decay
    device             = opt.device
    gpu_num            = opt.gpu_num

    # 判断最佳模型是否已经存在,若存在则直接读取,若不存在则进行预训练
    if os.path.exists(f'checkpoints/best_model{model_index}.pth'):
        try:
            best_model = LSTM()
            best_model.load_state_dict(torch.load(f'checkpoints/best_model{model_index}.pth'))
            return best_model
        except:
            pass
    else:
        pass
    model_save_dir  = 'checkpoints'
    models = opt.models
    model_name      = models[model_index-1]
    result_save_dir = 'results'

    # 模型加载
    model = LSTM().to(device)

    # Optimizer loading
    if device     != 'cpu' and gpu_num > 1:
        optimizer = torch.optim.NAdam(model.module.parameters(), lr=learning_rate, weight_decay=weight_decay)
        optimizer = torch.nn.DataParallel(optimizer, device_ids=list(range(gpu_num)))
    else:
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

    # 损失函数加载
    if opt.use_BCE:
        loss_func = nn.BCEWithLogitsLoss()
    else:
        loss_func = nn.CrossEntropyLoss()

    # 模型训练
    # -模型训练这部分的代码来源于我的Github项目:Torcheasy - https://github.com/Zeaulo/Torcheasy
    best_epoch         = 0
    best_train_loss    = 100000
    train_acc_list     = []
    train_loss_list    = []
    fixed_val_acc      = None
    best_model_loss    = 100000
    models_list        = [i for i in range(consideration_num)]
    start_time         = time.time()

    for epoch in range(set_epoch):
        model.train()
        train_loss = 0
        train_acc = 0
        for i, (x, y) in enumerate(train_loader):
            x                   = x.to(device)
            y                   = y.to(device)
            outputs, embedding  = model(x)
            
            if opt.use_BCE:
                loss                = loss_func(outputs, y.float().unsqueeze(1))
            else:
                loss                = loss_func(outputs, y)
            train_loss         += loss.item()
            optimizer.zero_grad()
            loss.backward()

            if device != 'cpu' and gpu_num > 1:
                optimizer.module.step()
            else:
                optimizer.step()
            
            if not opt.use_BCE:
                _, predicted = torch.max(outputs.data, 1)
            else:
                predicted = (outputs > 0.5).int()
                predicted = predicted.squeeze(1)
            train_acc   += (predicted == y).sum().item()
            
        average_mode = 'binary'
        train_f1     = metrics.f1_score(y.cpu(), predicted.cpu(), average=average_mode)
        train_pre    = metrics.precision_score(y.cpu(), predicted.cpu(), average=average_mode)
        train_recall = metrics.recall_score(y.cpu(), predicted.cpu(), average=average_mode)

        train_loss /= len(train_loader)
        train_acc  /= len(train_loader.dataset)
        train_acc_list.append(train_acc)
        train_loss_list.append(train_loss)

        print('-'*50)
        print('Epoch [{}/{}]\n Train Loss: {:.4f}, Train Acc: {:.4f}'.format(epoch + 1, set_epoch, train_loss, train_acc))
        print('Train-f1: {:.4f}, Train-precision: {:.4f} Train-recall: {:.4f}'.format(train_f1, train_pre, train_recall))

        # evaluate val
        model.eval()
        with torch.no_grad():
            val_acc = 0
            val_loss = 0
            for i, (x, y) in enumerate(val_loader):
                x = x.to(device)
                y = y.to(device)
                outputs, embedding = model(x)
                if opt.use_BCE:
                    loss = loss_func(outputs, y.float().unsqueeze(1))
                else:
                    loss = loss_func(outputs, y)
                val_loss += loss.item()
                if not opt.use_BCE:
                    _, predicted = torch.max(outputs.data, 1)
                else:
                    predicted = (outputs > 0.5).int()
                    predicted = predicted.squeeze(1)
                val_acc += (predicted == y).sum().item()
            val_loss /= len(val_loader)
            val_acc  /= len(val_loader.dataset)
            print('Val Loss: {:.4f}, Val Acc: {:.4f}'.format(val_loss, val_acc))

        # Choose the best model for saving1
        # (epoch+1)%save_num -> Replace the old model in the list
        models_list[(epoch+1)%consideration_num] = model
        if epoch+1 >= consideration_num:
            # -save_num:-1 -> The model score list is always in the last few numbers
            models_list_loss = train_loss_list[-consideration_num:-1]
            # model_loss -> model_loss is the average loss of the model list
            # The purpose is to find the most stable model list
            model_loss = np.mean(models_list_loss)
            if model_loss <= best_model_loss:
                best_model_loss = model_loss
                # Choose the best model in the model list -> The best model in the most stable model list
                perfect = np.argmin(models_list_loss)
                print(f'-> best_model_loss {model_loss} has accessed, saving model...')
                fixed_val_acc = val_acc
                if device == 'cuda' and gpu_num > 1:
                    torch.save(models_list[perfect].module.state_dict(), f'{model_save_dir}/best_model{model_index}.pth')
                else:
                    torch.save(models_list[perfect].state_dict(), f'{model_save_dir}/best_model{model_index}.pth')

        if train_loss < best_train_loss:
            best_train_loss = train_loss
            best_epoch = epoch + 1
        
        # Early stopping
        if epoch+1 - best_epoch == early_stop:
            print(f'{early_stop} epochs later, the loss of the validation set no longer continues to decrease, so the training is stopped early.')
            end_time = time.time()
            print(f'Total time is {end_time - start_time}s.')
            # write fixed_val_acc to result
            with open(f'{result_save_dir}/{model_name}_result.txt', 'a') as f:
                f.write(f'{model_name} {model_index} {fixed_val_acc}\n')
            break

        # Draw the accuracy and loss function curves of the training set and the validation set
        plt.figure()
        plt.plot(range(1, len(train_acc_list) + 1), train_acc_list, label='train_acc')
        plt.legend()
        plt.savefig(f'{result_save_dir}/{model_name}_acc.png')
        plt.close()
        plt.figure()
        plt.plot(range(1, len(train_loss_list) + 1), train_loss_list, label='train_loss')
        plt.legend()
        plt.savefig(f'{result_save_dir}/{model_name}_loss.png')
        plt.close()
    best_model = LSTM()
    best_model.load_state_dict(torch.load(f'checkpoints/best_model{model_index}.pth'))
    return best_model

# 三、模型推理
def model_predict(model, model_index, train_loader, val_loader):
    model.to('cpu')
    y_train_labels   = []
    train_embedding = []
    val_embedding  = []
    val_outputs    = None
    for i, (data, label) in enumerate(train_loader):
        data = data.to('cpu')
        _, embedding = model(data)
        if i == 0:
            train_embedding = embedding
            y_train_labels  = label
        else:
            train_embedding = torch.cat([train_embedding, embedding], dim=0)
            y_train_labels  = torch.cat([y_train_labels, label], dim=0)
    train_embedding = train_embedding.detach().numpy()

    for i, data in enumerate(val_loader):
        data = data[0].to('cpu')
        outputs, embedding = model(data)
        if i == 0:
            val_embedding = embedding
            val_outputs = outputs
        else:
            val_embedding = torch.cat([val_embedding, embedding], dim=0)
            val_outputs = torch.cat([val_outputs, outputs], dim=0)
    val_embedding = val_embedding.detach().numpy()
    # 保存val_outputs,预测概率
    if not opt.use_BCE:
        val_outputs = torch.softmax(val_outputs, dim=1)
    torch.save(val_outputs, f'./pred_prob/{model_index}_prob.pth')

def run(model_index):
    # 随机种子设置
    seed = opt.seed
    torch.seed = seed
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

    train_dataset, val_dataset = load_data(model_index)
    # 打印数据集信息
    print('-数据集信息:')
    print(f'-训练集样本数:{len(train_dataset)},测试集样本数:{len(val_dataset)}')
    train_labels = len(set(train_dataset.tensors[1].numpy()))
    # 看一下各类样本数目是否均衡
    print(f'-训练集的标签种类个数为:{train_labels}')
    numbers = [0] * train_labels
    for i in train_dataset.tensors[1].numpy():
        numbers[i] += 1
    print(f'-训练集各种类样本的个数:')
    for i in range(train_labels):
        print(f'-{i}的样本个数为:{numbers[i]}')

    batch_size         = opt.batch_size
    # 迭代器
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    val_loader  = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False)
    val_Predict_loader  = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=1, shuffle=False)

    best_model = model_pretrain(model_index, train_loader, val_loader)
    # 使用验证集来进行模型的评估
    model_predict(best_model, model_index, train_loader, val_Predict_loader)

if __name__ == '__main__':
    # for i in range(1, 8):
    #     run(i)
    
    # 单独训练
    # run(8)
    # run(9)
    # run(10)
    # run(11)
    # run(13)
    # run(14)
    run(15)

输出最终的结果(展示前45):

第一次尝试并提交A榜模型集成组合的是1~7,也就是在上图排名第45,而表现最好的模型集成组合是(2,3,4,5),也就是bert-large-uncased_embedding+LSTM, deberta-v3-base_embedding+LSTM,deberta-v3-large_embedding+LSTM,roberta-base_embedding+LSTM这四个模型的集成

2.关键词提取

由于以前没做过关键词提取这个任务,并且在B榜开启之后,我还花费了一天的时间(24号)来捣鼓医学文献分类的模型集成,在分类任务上确保能拿到0.369+分数的同时,我在关键词提取任务上剩下的时间非常紧张,所以正如标题所说,我真正打B榜的时间其实只有两天。而在第二天(25号)的时候,我便开始着手关键词提取的研究。使用了TF-IDF、TextRank、yake的方法,表现最好的是yake的方法,能拿到0.43459的分数。

代码:

import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
import yake

# 读取testB.csv
testB = pd.read_csv('../data_preprocess/testB.csv', header=0, delimiter=',',
 names=['uuid','title','author','abstract'])
# 读取提交文件
submit = pd.read_csv('../submit/Ensemble_LSTM_submitB.csv', header=0, delimiter=',',
 names=['uuid','Keywords','label'])

# 加载模型
kw_extractor = yake.KeywordExtractor(n=4, top=40, dedupLim=1)

# 将title和abstract合并在一起
testB['title_abstract'] = testB['title'] + testB['abstract']

# 遍历里面的文本并提取关键词
all_lines_keywords = []
for text in tqdm(testB['title_abstract']): 
    keywords = kw_extractor.extract_keywords(text)
    keywords = [keyword[0] for keyword in keywords]
    keywords = '; '.join(keywords)
    all_lines_keywords.append(keywords)

# 将提取好的关键词填入进去
for i in range(len(submit['Keywords'])):
    submit['Keywords'][i] = all_lines_keywords[i]

# 保存
submit.to_csv('Ensemble_LSTM_yake.csv',index=False)

提交结果:

此时我的排名为第四名,但我的提交次数已达上限。

接下来我打算用文本生成的方法来准备继续提分。

去租了一张NVIDIA 3090后,花了一个下午+一个晚上的时间把环境部署好,代码捣鼓好。

但考虑到lora微调的时间和模型推理2000条样本的时间后,迫不得已正式赛只能止步于此。

等到正式赛过去后,试一下用ChatGLM2(提取前1200个样本的关键词)+yake(提取后800个样本的关键词),然后提交,所得分数为:0.51063

直接用ChatGLM2提取全部样本的关键词,然后提交,所得的分数为:0.564

三、比赛收获

1.第一次参加这种形式的夏令营,认识了很多大佬。

2.第一次只花七天的时间来打比赛(7/20~7/26),很满意自己能在这么短的时间内打出来的成绩。

3.第一次自己用代码实现想要的模型集成,并且取得了理想的效果。

4.第一次做关键词提取任务,学习了各种各样的方法:

①TF-IDF+embedding

②TextRank

③yake

5.第一次使用大语言模型微调这一个方法来打比赛。

6.掌握了更多的比赛提分tricks:

①k折交叉验证

②模型集成

7.积累了更多的NLP比赛的经验:

①要提前准备比赛并能够在本地上验证模型性能

②在不清楚测试集和训练集的分布是否偏移的情况下,可将训练集:验证集划分为5:5,验证集要足够大,对模型的评估结果才能更准确。

③在需要用到大模型微调的时候,需要留出更多的时间

->以上内容修改源于我在Datawhale夏令营的笔记:‌‍‬‍​​⁣‍⁢⁡‬​​‌​​‬‌​⁢⁢‍‌‬​​⁡​⁡⁢⁣‍‍⁡​​‌​‬‌⁤​​⁤Datawhale AI夏令营 - NLP实践:基于论文摘要的文本分类与关键词抽取挑战赛——两天极限打B榜 - 飞书云文档 (feishu.cn) 


至此、

谢谢你能阅读到它的结尾。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值