- Python 100%
| .gitignore | ||
| calculate_roberta_baseline.py | ||
| calculate_svm_baseline.py | ||
| generate_cctaa.py | ||
| LICENSE | ||
| metadata.csv | ||
| README.md | ||
| requirements.txt | ||
Chinese Cross-Topic Authorship Attribution (CCTAA) Corpus
Overview
The CCTAA corpus is a Chinese authorship attribution testbed for contemporary Mandarin prose. CCTAA encourages models which focus narrowly on topic-independent writing style and supports reproducible research.
The CCTAA corpus contains single-author newswire articles using simplified Chinese characters from 500 reporters affiliated with the Xinhua News Agency. Each author appears in all three splits, upholding the closed-set assumption of many authorship attribution models. For training, every author contributes multiple passages which consist of one or more paragraphs, with cumulatively no fewer than 5,000 characters. Authors have exactly one sample in the validation and test sets. Examples in the two sets have more than 400 characters.
We carefully control the topics between train, dev, and test sets, such that one's training topic(s) will not appear in their dev or test sets. The topic of validation and testing examples for an author may or may not be the same. See the below table for statistics.
| Split | No. Authors | Character per Author (s.d.) | Passages per Author (s.d.) |
|---|---|---|---|
| Train | 500 | 5305(247) | 11(2) |
| Validation | 500 | 460(208) | 1(0) |
| Test | 500 | 471(226) | 1(0) |
We refer you to our paper [TODO] for details.
Generating CCTAA
We cannot distribute CCTAA as per the regulations of the LDC.
Instead, we distribute a script (generate_cctaa.py) to help our user generate CCTAA from the LDC Chinese Gigaword Second Edition.
Assume Python 3.8 is installed, follow the below procedure (perhaps in an analogous manner).
Please put cctaa-v1.0.0 under the same parent directory of chinese-gigaword-2e.
Keep chinese-gigaword-2e intact.
# create a new venv, activate venv, and install needed packages
$ python3 -m venv .
$ source bin/activate
$ pip3 install -r requirements.txt
$ python3 generate_cctaa.py
The CCTAA corpus (cctaa-v1.0.0.csv) will appear within the folder after the progress bar is done.
A warning should be issued if the newly compiled corpus does not pass a MD5 check (2e2a54811f59944968c6929b5ec891e7).
The input and output directory can be specified with -i and -o, check python3 generate_cctaa.py -h for details.
You may want to clean up all spaces and direct quotations before anything (e.g., with re.sub("\s|“[\u4E00-\u9FFF,。《》\(\);:‘’\!\?\s.]+?”", "", sometext)).
Baselines
SVM
We provide a linear SVM (sklearn.svm.SVC(kernel='linear', C=1)) baseline that learns on Chinese function
character n-gram frequencies.
The frequencies are counted with functionwords.FunctionWords('chinese_simplified_modern) after cleaning up all spaces and direct quotaions.
The SVM performs 3.0% accuracy on the test set.
All the packages relied on and their versions are specified in the requirements.text.
Run the following command to reproduce the baseline.
$ python3 calculate_svm_baseline.py
RoBERTa
A RoBERTa baseline is also included. Due to the complexity of reproducing a deep learning model, we provide the script (calculate_roberta_baseline.py) as well as its performance monitored with wandb, where all the (hyper)parameters can be found.
The packages are not included within the requirements.text, but the packages and version information can be found in the script.
License
All the materials is licensed under the ISC License.
Contact
Contact the repo maintainer for questions and bugs.
Paper
@inproceedings{wang2022cctaa,
title={CCTAA: A Reproducible Corpus for Chinese Authorship Attribution Research},
author={Wang, Haining and Riddell, Allen},
booktitle={Proceedings of the 13th Language Resources and Evaluation Conference},
year={2022}
}