Resources for a workshop I offered on text mining in the digital humanities
This repository has been archived on 2025-08-04. You can view files and clone it, but you cannot make any changes to its state, such as pushing and creating new issues, pull requests or comments.
Find a file
2020-05-17 21:58:39 +02:00
11th-of-the-month Tweaks, add some more images, links. 2016-02-23 11:49:59 -06:00
drug-interaction Add a couple examples of particularly cool cases. 2016-02-23 14:34:08 -06:00
hoover Add Hoover's examples (Austen, Dracula), add drug interaction to slides. 2016-02-23 16:05:28 -06:00
nlp Add examples of algorithmic NLP errors. 2016-02-23 13:31:08 -06:00
screenshots Add Voyant demo slide. 2016-02-24 16:33:32 -06:00
twitter Add a couple examples of particularly cool cases. 2016-02-23 14:34:08 -06:00
variant-spelling Make graph a litle bigger. 2016-02-23 12:06:19 -06:00
AustenCorpus.zip Add Austen corpus for Voyant demo. 2016-02-25 12:07:59 -06:00
COPYING Initial commit of skeleton README for workshop. 2016-02-23 11:08:39 -06:00
demo-steps.md Add steps I'll use during the demo. 2016-02-25 14:41:34 -06:00
README.md Add a missing newline. 2020-05-17 21:58:39 +02:00
slides.md Update modification dates. (#1) 2016-03-08 11:11:53 -06:00
slides.pdf Update actual slides PDF. (#1) 2016-03-08 11:12:09 -06:00

Companion to Text Mining Workshop

Charles H. Pence, Louisiana State University, Philosophy
Last Update: March 8, 2016


Are you at the workshop right now?

Then you need to download Voyant Server so you can follow along with our demo here in a little while! Head to the Voyant Server site and start downloading the ZIP file. Also, download the Jane Austen corpus we'll be feeding into Voyant.


Outline


Why text mining?

Big buckets of data

Assembling your own data

  • Project Gutenberg
    • A repository of public-domain books, available in plain text. Most have been at least proofread by a team of volunteers, so the text is very often in good condition.
  • Google Books
    • For public-domain books not appearing in Project Gutenberg, a scanned copy is very often available as a PDF from Google Books. Text can then be extracted from the PDF using a variety of tools. (Alternatively, you can re-process the optical character recognition (OCR) on the books using a tool such as ABBYY FineReader.)
  • EEBO-TCP / ECCO-TCP / Evans-TCP
    • EEBO: Early English Books Online, books written in English from 1475-1700, marked up in XML/TEI (just under 30,000 volumes)
    • ECCO: Eighteenth-Century Collections Online, books written in English in the 18th century, marked up in XML/TEI (just over 2,000 volumes)
    • Evans Early American Imprint Collection, books, pamphlets, and broadsides from America, 1640-1821, marked up in XML/TEI (just under 5,000 volumes)
  • JSTOR DFR
    • With an account, you may request from the administrators of the JSTOR DFR that you be able to download (limited amounts of) full-text articles via the DFR.
  • Open access journals
  • Social network data

Problems of access and quality

Getting your hands on the sources

OCR error and other textual problems

Algorithmic extraction of metadata/tagging

Text mining tools you can run locally

Data and reproducibility

  • When using tools like Voyant, make sure to export your data
    • CSV (comma-separated value) format can be read by Excel
    • TXT/RTF (text or rich text) formats can be read by your text editor or Word
  • Also document how you perform your analyses
    • Include what version of Voyant you use
    • Lay out in detail what data sources you used, where and when you got your data, etc.
    • Describe which tools you used and what settings you set
    • Think about how you'd describe what you're doing to someone who hadn't ever used the tool before
  • Consider making the switch to command-line-based (or scripting-based) tools when you feel capable, as these provide much higher reproducibility

Want to add to this workshop?

If you have a resource that you think would be useful to participants in a workshop like this, feel free to submit an issue on GitHub, or fork the repository and send me a pull request!


License

All content here produced by Charles Pence (and not licensed in other ways as noted) is available under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). For copyrights and credits for some of these images, check out the subdirectories in the repository. The Austen corpus is released under the terms of the Project Gutenberg Terms of Use, as the corpus is edited slightly (to remove licensing language, as it would skew results) from original Project Gutenberg source material.