Web-as-corpus tools in Java.
* Simple Crawler (and also integration with Nutch and Heritrix)
* HTML cleaner to remove boiler plate code
* Language recognition
* Corpus builder

Project Activity

See All Activity >

License

Apache License V2.0

Follow JavaWAC

JavaWAC Web Site

Other Useful Business Software
Run applications fast and securely in a fully managed environment Icon
Run applications fast and securely in a fully managed environment

Cloud Run is a fully-managed compute platform that lets you run your code in a container directly on top of scalable infrastructure.

Run frontend and backend services, batch jobs, deploy websites and applications, and queue processing workloads without the need to manage infrastructure.
Try for free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of JavaWAC!

Additional Project Details

Intended Audience

Science/Research

User Interface

Non-interactive (Daemon), Web-based

Programming Language

Java

Related Categories

Java Search Engines, Java Frameworks, Java Intelligent Agents, Java Information Analysis Software

Registered

2008-04-11