| 11th-of-the-month | ||
| drug-interaction | ||
| hoover | ||
| nlp | ||
| screenshots | ||
| variant-spelling | ||
| AustenCorpus.zip | ||
| COPYING | ||
| demo-steps.md | ||
| README.md | ||
| slides.md | ||
| slides.pdf | ||
Companion to Text Mining Workshop
Charles H. Pence, Louisiana State University, Philosophy
Last Update: March 8, 2016
Are you at the workshop right now?
Then you need to download Voyant Server so you can follow along with our demo here in a little while! Head to the Voyant Server site and start downloading the ZIP file. Also, download the Jane Austen corpus we'll be feeding into Voyant.
Outline
- Why text mining?
- The big buckets
- Putting your own data together
- Perils of data access and quality
- Local text-mining tools
- Data and reproducibility
Why text mining?
- Burrows, Computation into Criticism (1987)
- Web-based Twitter sentiment analysis app
- Text-mining paper abstracts to detect drug interactions

Big buckets of data
- Google Ngram Viewer
- 8M books (= 6% of all books ever published)
- Basic information about the viewer, search methods, and its corpus
- Raw access to the Ngram datasets (warning: massive, massive download ahead)
- Most recent academic publication on the Ngram corpus, describing its part-of-speech tagging
- First major publication, describing the corpus in general
- HathiTrust Digital Library: Bookworm
- 4.6M books searchable in this database (of HT's 13.9M total)
- JSTOR Data for Research
- 9.3M journal articles
Assembling your own data
- Project Gutenberg
- A repository of public-domain books, available in plain text. Most have been at least proofread by a team of volunteers, so the text is very often in good condition.
- Google Books
- For public-domain books not appearing in Project Gutenberg, a scanned copy is very often available as a PDF from Google Books. Text can then be extracted from the PDF using a variety of tools. (Alternatively, you can re-process the optical character recognition (OCR) on the books using a tool such as ABBYY FineReader.)
- EEBO-TCP / ECCO-TCP / Evans-TCP
- EEBO: Early English Books Online, books written in English from 1475-1700, marked up in XML/TEI (just under 30,000 volumes)
- ECCO: Eighteenth-Century Collections Online, books written in English in the 18th century, marked up in XML/TEI (just over 2,000 volumes)
- Evans Early American Imprint Collection, books, pamphlets, and broadsides from America, 1640-1821, marked up in XML/TEI (just under 5,000 volumes)
- JSTOR DFR
- With an account, you may request from the administrators of the JSTOR DFR that you be able to download (limited amounts of) full-text articles via the DFR.
- Open access journals
- The PMC Open Access Subset
- PLoS Journals are available in XML
- BMC Journals may be downloaded in bulk
- Social network data
Problems of access and quality
Getting your hands on the sources
- For-profit journals
- Elsevier has a text-mining API, but you have to contort your project to work with it
- Effectively all other publishers require negotiation of a contract
- Contemporary books
- HathiTrust has them, but access is limited and controlled (LSU is not a "friend" library)
- Google Ngram worries
- Questions of statistical significance (letter 1, letter 2, response)
- Questions of breadth of corpus (blog post arguing that earlier works skew scientific/biblical)
- Questions of reconstruction of historical concepts
- Problems with the representativeness of social media data
OCR error and other textual problems
- The 11th of the month
- Original XKCD comic
- Graph of data as pulled from Google Ngrams:

- Graph of data after OCR error accounted for:

- Variant typography and spelling
- Frequencies of 'congrefs' and 'Congrefs' in the corpus from 1700–1850:

- More complex: frequencies of 'blefsed', 'bleffcd', 'blefled', 'bleffed', 'bleflcd', 'blcfled', and 'bleffcd' in the corpus from 1650–1850:

- An attempt at automated OCR correction rules for 17th- and 18th-century texts
- Frequencies of 'congrefs' and 'Congrefs' in the corpus from 1700–1850:
Algorithmic extraction of metadata/tagging
- Automated extraction can only be so good
- Stanford NLP named entity extraction on Austen's History of England
- "Lord Mounteagle" = LOCATION
- "Essex", "Warwick" = ORGANIZATION (confused with universities?)
- "Finis" = PERSON
- Stanford NLP part-of-speech tagging on Austen's History of England
- "AGAINST" = proper noun (confused by all-caps?)
- "who sailed round the World", "round" = noun (confused by archaic usage)
- Stanford NLP named entity extraction on Austen's History of England
Text mining tools you can run locally
- Voyant Tools
- Local download version
- Introductory workshop
- List of all available tools
- Links to some tools that can run on your local version of Voyant:
- The Intelligent Archive
- David Hoover's Excel Text Analysis Pages
- The Zeta and Iota Spreadsheets (usage guide)
- The Delta Spreadsheets (usage guide) [not used in the workshop, but potentially also useful]
Data and reproducibility
- When using tools like Voyant, make sure to export your data
- CSV (comma-separated value) format can be read by Excel
- TXT/RTF (text or rich text) formats can be read by your text editor or Word
- Also document how you perform your analyses
- Include what version of Voyant you use
- Lay out in detail what data sources you used, where and when you got your data, etc.
- Describe which tools you used and what settings you set
- Think about how you'd describe what you're doing to someone who hadn't ever used the tool before
- Consider making the switch to command-line-based (or scripting-based) tools when you feel capable, as these provide much higher reproducibility
Want to add to this workshop?
If you have a resource that you think would be useful to participants in a workshop like this, feel free to submit an issue on GitHub, or fork the repository and send me a pull request!
License
All content here produced by Charles Pence (and not licensed in other ways as noted) is available under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). For copyrights and credits for some of these images, check out the subdirectories in the repository. The Austen corpus is released under the terms of the Project Gutenberg Terms of Use, as the corpus is edited slightly (to remove licensing language, as it would skew results) from original Project Gutenberg source material.


