WARC Extractor

Dependencies

The WARC Extractor depends on the warcat library and pandoc. You can install these dependencies as follows:

pip3 install warcat

Install `pandoc`

sudo apt install pandoc

Usage

To run the WARC Extractor, use the provided shell script. The script takes the path to a folder containing WARC files as an argument.

Running the Script

bash run.sh WARC_FOLDER_PATH

For example, if your WARC files are located in the ACM directory, you can run:

bash run.sh ACM

This will process the WARC files in the ACM folder, such as ACM/2020-05_ACM_0001.warc.gz.

Output

The WARC Extractor generates two types of output, which are saved in separate directories:

Extracted Data: This directory contains data extracted directly from the WARC files.
- Directory: extracted_data
Flatten Data: This directory contains data that has been reprocessed to reduce redundant folders and paths.
- Directory: flatten_data

Extracting HTML Content

To extract HTML content from the extracted data, you can use the convert_html.sh script. The script requires the path to the HTML file as an argument.

Running the HTML Extraction Script

bash convert_html.sh $html_path

Replace $html_path with the path to your HTML file. For example:

bash convert_html.sh extracted_data/example.html

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
convert_html.sh		convert_html.sh
extract_image.py		extract_image.py
extract_txt.py		extract_txt.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WARC Extractor

Dependencies

Install `pandoc`

Usage

Running the Script

Output

Extracting HTML Content

Running the HTML Extraction Script

About

Uh oh!

Releases

Packages

Languages

akira-l/warc_extractor_mnbvc

Folders and files

Latest commit

History

Repository files navigation

WARC Extractor

Dependencies

Install pandoc

Usage

Running the Script

Output

Extracting HTML Content

Running the HTML Extraction Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Install `pandoc`

Packages