This asynchronous web scraper is designed to extract metadata from public Pornhub video pages. It is developed for academic NLP and social computing research.
Steps:
- Load a CSV file (data.csv) with video URLs (delimited by ‽).
- For each video URL retrieve:
- Title
- Upload date
- Vote count (upvotes)
- View count
- Categories
- Tags
- Use aiohttp for fast concurrent scraping (primary method).
- Retry failed aiohttp requests with exponential backoff.
- Fall back to Playwright if aiohttp fails after all retries.
- Log hard failures if even Playwright can't scrape.
- Append results in batches to .csv, avoiding full reloads.
- Retry failed rows from the CSV using a two-pass strategy.