dataset_mention_extraction

This repo contains code for extracting datset mentions from scientific text.

Prepare Data:

Download kaggle dataset from https://www.kaggle.com/competitions/coleridgeinitiative-show-us-the-data/data and save it in the data folder. You should have train folder and train.csv file directly under data folder.
From data_wrangling module use the "process_kaggle_data_with_id" function to extract the contexts and generate "all_samples.csv" file. This module contains also a function "" that counts the frequent words in positive and negative contexts. The frequent words are used to create a list of five questions.
Use create_folds module to create the five folds that will be used in the experiments.

Classification Experiments:
4. To get the classification result for a language model. run module cls_exp.py. It will generate a log file that contains the reuslts for each fold using the language model whose check point is given. It also stores the best performing model for each module. 5. BERT_MLP2_BFL.py uses MLP2 as classification head for Bert. 6. mlp_exp.py module can be used to generate MLP-2 on Bert-mean or TF-IDF results. 7. To use custome tokenization, cusotme_tokenizer.py module ca be used. It generates the trained tokenizer that you can used to replace or to modify the orignal tokenizer for a language model. 8. xgb_exp.py module can be used to classify contexts using XGBoosting on bert-mean and tfidf. 9. ensemble_meta_model_exp.py stacks an ensemble of three MLP-2 detectors each with different settings. First run mlp_exp module using the different settings to create the three models then use ensemble_meta_model_exp to combine them.

NER Experiments:
10. Moduel ner_exp.py is used to generate ner results using langaue models. 11. Module ner_space.py is used to generate ner results using spacy. For that you need to have the configs folder which is provided in the code and need to download spacy using the instruction python -m spacy download en_core_web_md then you can use the code.

QA Experiments:
12. Moudle qa_exp.py can be used to generate the question answering experiments.

Notebooks:
13. svm_cls_exp.ipynb: this notebook uses PCA, TSNe as input to SVM to do the detection task. 14. MLP2-analsis.ipynb: this notebook applies MLP-2 on the folds and store the results as a dictionay in a text file. It also used to analyze the detector resutls on fold 0. 15. pipe_and_qa_analysis.ipynb: this notebook is used to pipe the detector and extractor. It is also used to analyze the pipe performance when deberta with q3 is used on fold 0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/exp_data		data/exp_data
BERT_MLP2_BFL.py		BERT_MLP2_BFL.py
BertWithMLP2Head.py		BertWithMLP2Head.py
MLP2-analysis.ipynb		MLP2-analysis.ipynb
MLP_2.py		MLP_2.py
README.md		README.md
cls_exp.py		cls_exp.py
create_folds.py		create_folds.py
custom_tokenizer.py		custom_tokenizer.py
data_wrangling.py		data_wrangling.py
dataset_id_processings.py		dataset_id_processings.py
ensemble_meta_model_exp.py		ensemble_meta_model_exp.py
general_util.py		general_util.py
mlp_exp.py		mlp_exp.py
ner_exp.py		ner_exp.py
pipe_and_qa_analysis.ipynb		pipe_and_qa_analysis.ipynb
qa_exp.py		qa_exp.py
svm_cls_exp.ipynb		svm_cls_exp.ipynb
xgb_exp.py		xgb_exp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataset_mention_extraction

About

Uh oh!

Releases

Packages

Languages

akastrin/dataset-mention-extraction

Folders and files

Latest commit

History

Repository files navigation

dataset_mention_extraction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages