Skip to content

Tool to extract structured metadata (e.g., upload date, tags, views) from video URLs for analysis of stereotypes in online pornography.

License

Notifications You must be signed in to change notification settings

ptypes-nlesc/metadata-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This asynchronous web scraper is designed to extract metadata from public Pornhub video pages. It is developed for academic NLP and social computing research.

Steps:

  1. Load a CSV file (data.csv) with video URLs (delimited by ‽).
  2. For each video URL retrieve:
  • Title
  • Upload date
  • Vote count (upvotes)
  • View count
  • Categories
  • Tags
  1. Use aiohttp for fast concurrent scraping (primary method).
  2. Retry failed aiohttp requests with exponential backoff.
  3. Fall back to Playwright if aiohttp fails after all retries.
  4. Log hard failures if even Playwright can't scrape.
  5. Append results in batches to .csv, avoiding full reloads.
  6. Retry failed rows from the CSV using a two-pass strategy.

About

Tool to extract structured metadata (e.g., upload date, tags, views) from video URLs for analysis of stereotypes in online pornography.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages