GitHub - ptypes-nlesc/metadata-scraper: Tool to extract structured metadata (e.g., upload date, tags, views) from video URLs for analysis of stereotypes in online pornography.

This asynchronous web scraper is designed to extract metadata from public Pornhub video pages. It is developed for academic NLP and social computing research.

Steps:

Load a CSV file (data.csv) with video URLs (delimited by ‽).
For each video URL retrieve:

Title
Upload date
Vote count (upvotes)
View count
Categories
Tags

Use aiohttp for fast concurrent scraping (primary method).
Retry failed aiohttp requests with exponential backoff.
Fall back to Playwright if aiohttp fails after all retries.
Log hard failures if even Playwright can't scrape.
Append results in batches to .csv, avoiding full reloads.
Retry failed rows from the CSV using a two-pass strategy.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
async_scraper.py		async_scraper.py
posthoc.py		posthoc.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ptypes-nlesc/metadata-scraper

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages