The WARC Extractor depends on the warcat library and pandoc. You can install these dependencies as follows:
pip3 install warcatsudo apt install pandocTo run the WARC Extractor, use the provided shell script. The script takes the path to a folder containing WARC files as an argument.
bash run.sh WARC_FOLDER_PATHFor example, if your WARC files are located in the ACM directory, you can run:
bash run.sh ACMThis will process the WARC files in the ACM folder, such as ACM/2020-05_ACM_0001.warc.gz.
The WARC Extractor generates two types of output, which are saved in separate directories:
-
Extracted Data: This directory contains data extracted directly from the WARC files.
- Directory:
extracted_data
- Directory:
-
Flatten Data: This directory contains data that has been reprocessed to reduce redundant folders and paths.
- Directory:
flatten_data
- Directory:
To extract HTML content from the extracted data, you can use the convert_html.sh script. The script requires the path to the HTML file as an argument.
bash convert_html.sh $html_pathReplace $html_path with the path to your HTML file. For example:
bash convert_html.sh extracted_data/example.html