No description
Find a file
2022-04-29 00:19:59 -04:00
.gitignore update: add roberta baseline and modify README 2022-04-26 14:00:27 -04:00
calculate_roberta_baseline.py update: add roberta baseline and modify README 2022-04-26 14:00:27 -04:00
calculate_svm_baseline.py update: add roberta baseline and modify README 2022-04-26 14:00:27 -04:00
generate_cctaa.py update: formatting 2022-04-25 14:48:37 -04:00
LICENSE Initial commit 2022-04-25 20:38:11 +02:00
metadata.csv init 2022-04-25 14:44:22 -04:00
README.md update: modify README 2022-04-29 00:19:59 -04:00
requirements.txt init 2022-04-25 14:44:22 -04:00

Chinese Cross-Topic Authorship Attribution (CCTAA) Corpus

Overview

The CCTAA corpus is a Chinese authorship attribution testbed for contemporary Mandarin prose. CCTAA encourages models which focus narrowly on topic-independent writing style and supports reproducible research.

The CCTAA corpus contains single-author newswire articles using simplified Chinese characters from 500 reporters affiliated with the Xinhua News Agency. Each author appears in all three splits, upholding the closed-set assumption of many authorship attribution models. For training, every author contributes multiple passages which consist of one or more paragraphs, with cumulatively no fewer than 5,000 characters. Authors have exactly one sample in the validation and test sets. Examples in the two sets have more than 400 characters.

We carefully control the topics between train, dev, and test sets, such that one's training topic(s) will not appear in their dev or test sets. The topic of validation and testing examples for an author may or may not be the same. See the below table for statistics.

Split No. Authors Character per Author (s.d.) Passages per Author (s.d.)
Train 500 5305(247) 11(2)
Validation 500 460(208) 1(0)
Test 500 471(226) 1(0)

We refer you to our paper [TODO] for details.

Generating CCTAA

We cannot distribute CCTAA as per the regulations of the LDC. Instead, we distribute a script (generate_cctaa.py) to help our user generate CCTAA from the LDC Chinese Gigaword Second Edition.

Assume Python 3.8 is installed, follow the below procedure (perhaps in an analogous manner). Please put cctaa-v1.0.0 under the same parent directory of chinese-gigaword-2e. Keep chinese-gigaword-2e intact.

# create a new venv, activate venv, and install needed packages
$ python3 -m venv .
$ source bin/activate
$ pip3 install -r requirements.txt
$ python3 generate_cctaa.py

The CCTAA corpus (cctaa-v1.0.0.csv) will appear within the folder after the progress bar is done. A warning should be issued if the newly compiled corpus does not pass a MD5 check (2e2a54811f59944968c6929b5ec891e7). The input and output directory can be specified with -i and -o, check python3 generate_cctaa.py -h for details.

You may want to clean up all spaces and direct quotations before anything (e.g., with re.sub("\s|“[\u4E00-\u9FFF,。《》\(\);:‘’\!\?\s.]+?”", "", sometext)).

Baselines

SVM

We provide a linear SVM (sklearn.svm.SVC(kernel='linear', C=1)) baseline that learns on Chinese function character n-gram frequencies. The frequencies are counted with functionwords.FunctionWords('chinese_simplified_modern) after cleaning up all spaces and direct quotaions.
The SVM performs 3.0% accuracy on the test set. All the packages relied on and their versions are specified in the requirements.text. Run the following command to reproduce the baseline.

$ python3 calculate_svm_baseline.py

RoBERTa

A RoBERTa baseline is also included. Due to the complexity of reproducing a deep learning model, we provide the script (calculate_roberta_baseline.py) as well as its performance monitored with wandb, where all the (hyper)parameters can be found. The packages are not included within the requirements.text, but the packages and version information can be found in the script.

License

All the materials is licensed under the ISC License.

Contact

Contact the repo maintainer for questions and bugs.

Paper

@inproceedings{wang2022cctaa,
  title={CCTAA: A Reproducible Corpus for Chinese Authorship Attribution Research},
  author={Wang, Haining and Riddell, Allen},
  booktitle={Proceedings of the 13th Language Resources and Evaluation Conference},
  year={2022}
}