Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
Laura Mandell, PI, IDHMC
Apostolos Antonacopoulos, PRImA Lab
Clemens Neudecker, Koninklijke Bibliotheek
Matthew Christy, Co-Project Manager, IDHMC
Loretta Auvil, SEASR Analytics
Todd Samuelson, Cushing Memorial Library
emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Straight from the grant proposal…
“Our overarching goals”
1) Train three open-access OCR engines to “read” early modern
fonts
2) Map specific font training onto specific sets of documents
3) Create error-evaluation mechanisms for failed documents
4) Use crowd-sourced correction tools specific to OCR errors
5) Identify pages that are too flawed to be “readable”
6) Share our workflow procedure and results, so that the
community can use them in digitizing and transcribing early
modern documents.
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Main Collaborators
CIIR
IDHMC + Cushing Memorial Library
Koninklijke Bibliotheek
Performant Software Solutions
PRImA Labs
PSI Labs
SEASR
UMass Amhearst
Texas A&M
Netherlands
Charlottesville, Virginia
University of Salford, Manchester
Texas A&M
U of Illinois, Urbana-Champaign
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Data Contributors + Collaborators
Early English Books Online (EEBO)
Eighteenth Century Collections Online (ECCO)
Text Creation Partnership (TCP)
Brazos Computing Cluster (Texas A&M)
Main Collaborators
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Laura Mandell, Principal Investigator, eMOP
Director, IDHMC
@mandellc
idhmc@tamu.edu
Early Modern
Printing
• Individual, hand-made
typefaces
• Worn and broken type
• Poor quality equipment/paper
• Inconsistent line bases
• Unusual page layouts,
decorative page elements,
• Special characters & ligatures
• Spelling variations
• Mixed typefaces and languages
Slides by Matthew Christy 7
Slides by Matthew Christy 8
• Irregular Layouts
• Print Bleedthrough
Document/Image
Quality
• Torn and damaged
pages
• Noise introduced to
images of pages
• Skewed pages
• Warped pages
• Missing pages
• Inverted pages
• Incorrect metadata
• Extremely low quality
TIFFs (~50K)
Slides by Matthew Christy 9
Slides by Matthew Christy 10
11
There may be as much
difference between one letter
and another in a specific font
As there is between letters in
different fonts.
Reality
Dream
Training Tesseract in different
fonts and applying them to the
documents printed in those
particular fonts will improve OCR
quality.
Training Tesseract
Aletheia
Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on
the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an
XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode
values.
Training Tesseract
Franken+
1. Takes Aletheia's output files as input.
2. Groups all glyphs with the same Unicode values
into one window for comparison.
3. Mistakenly coded glyphs are easily identified and
re-coded.
4. A user can quickly compare all exemplars of a
glyph and choose just the best subset, if desired.
5. Uses all selected glyphs to create a Franken-page
image (TIFF) using a selected text as a base.
6. Outputs the same box files and TIFF images that
Tesseract's first stage of native training.
7. Also allows users to complete Tesseract training
using newly created box/TIFF file pairs, and add
optional dictionary and other files.
8. Outputs a .traineddata file used by Tesseract
when OCRing page images.
Slides by Matthew Christy 13
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Clemens Neudecker, Koninklijke Bibliotheek
@cneudecker
The case of IMPAC T
• IMPACT = IMProving ACcess to Text
• EU FP7, 2008 – 2012
• €16.7 M budget
• 22 partners (libraries, universities, companies)
• Goal: Significantly improve OCR for historical
documents
Issue 1
• Expectation: The "IMPACT OCR"
• Reality: A collection of very diverse tools,
algorithms, etc. Some prototypes, some
commercial tools, different programming
languages, different levels of maturity etc.
•
• No integrated product possible!
Issue 1
• Solution: Interoperability rather than integration
• Change: Individual applications as pluggable
modules in a web-based framework
• Result: Flexible framework with additional
benefits for testing, transparency, provenance
Issue 2
• Diversity: Librarians, Computer Scientists,
Computational Linguists, Humanists
• Are we really talking the same language?
• Different focus points in the project: applicable
solutions vs. academic publications
Issue 2
• Solution: Create bonding activities, foster
atmosphere for knowledge exchange
• Change: Buddy programme, social games,
quizzes about partners
• Result: Understand your partners background,
their way of thinking
enrich the experience for everyone
Large Digitisation Projects:
Two Key Perspectives
Apostolos Antonacopoulos
PRImA Research Lab
Background
Since 2002 the PRImA Lab has been involved in large digitisation
projects, creating software tools for all stages of the workflow
• From Image Enhancement to Layout Analysis to OCR
• Use-scenario based evaluation of extracted text quality
• Crowd/Scholar-sourcing
Two general points are routinely underestimated:
• (Really) Understanding stakeholders and their roles
• (Real) Understanding of problems, their extent and the
effectiveness/requirements of potential solutions
Stakeholders and their
roles
Seems obvious and often mentioned but the significance of
understanding this point and its effects is vastly underestimated
Content holders
• Keen for their content to be widely available and used
• Do not know their content well and neither its potential uses
Computer scientists
• Have technical expertise to solve many of the problems
• Do not know the material and its use to prioritise problems well
DH researchers – the catalysts
• Very knowledgeable of material and potential use
• Have complementary technical skills to computer scientists
Problem understanding
At the start of each project everyone is eager to deliver “big” results but
it is important to identify and understand a few key problems and solve
them well
“Improve OCR results” is an ill-defined and short-sighted goal
• Measured in terms of word-accuracy, OCR results are of little use
• Layout is very important
• Even if all the words are recognised correctly, the reading order is unlikely to be
correct, limiting potentially interesting uses.
• Page numbers, captions, running headers etc. should not be mixed with body text
• Graphical elements / illustrations are important too
Think: Useful data (investment) vs. just more of any data (instant
gratification)
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
@EMGrumbach
egrumbac@tamu.edu
“If an electronic scholarly project can’t fail and
doesn’t produce new ignorance, then it isn’t
worth a damn.”
- John Unsworth
“Documenting the Reinvention of Text: The Importance of Failure”
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Challenges +Failures
should be constantly or
consistently
communicated.
Analysis + New
Directions
should lead to research
and communication
with similar projects.
Adaptability
should allow for new
possibilities, new
questions.
Navigating the Storm | @EMGrumbach | emop.tamu.edu

More Related Content

PPTX
DLF Forum 2015: Beyond eMOP
PPTX
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
PPTX
TCDL15 Beyond eMOP
PPTX
mchristy-DH2014-emop-bookhistory-tools
PPTX
eMOP-PennSt-lunch
PPTX
Mchristy-eMOP-workflows2-24x7
PPTX
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFP
PPTX
Tamu big data-conf-1b
DLF Forum 2015: Beyond eMOP
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
TCDL15 Beyond eMOP
mchristy-DH2014-emop-bookhistory-tools
eMOP-PennSt-lunch
Mchristy-eMOP-workflows2-24x7
From Early Modern Printing to Post-Modern Indie Publishing: Using eMOP on AFP
Tamu big data-conf-1b

What's hot (18)

PPTX
mchristy-Dh2014- emop-postOCR-triage
PPTX
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
PPTX
Dh2014 e mopcobre-complete
PPTX
Once upon a time in Datatown ...
PDF
Can functional programming be liberated from static typing?
PDF
How well does your Instance Matching system perform? Experimental evaluation ...
PDF
The State of #NLProc
PPTX
SCONUL Summer Conference 2019 - Svein Arne Brygfjeld
PDF
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
PPTX
Integration stories with OpenClinica and OpenXData
PDF
Link Discovery Tutorial Introduction
PPT
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
PDF
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
PDF
Aspects of NLP Practice
PDF
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
PDF
NLP Project Full Cycle
PDF
Chances and Challenges in Comparing Cross-Language Retrieval Tools
PPTX
From TREC to Watson: is open domain question answering a solved problem?
mchristy-Dh2014- emop-postOCR-triage
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
Dh2014 e mopcobre-complete
Once upon a time in Datatown ...
Can functional programming be liberated from static typing?
How well does your Instance Matching system perform? Experimental evaluation ...
The State of #NLProc
SCONUL Summer Conference 2019 - Svein Arne Brygfjeld
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Integration stories with OpenClinica and OpenXData
Link Discovery Tutorial Introduction
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Aspects of NLP Practice
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
NLP Project Full Cycle
Chances and Challenges in Comparing Cross-Language Retrieval Tools
From TREC to Watson: is open domain question answering a solved problem?
Ad

Similar to Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards (20)

PDF
DU_SERIES_Session1.pdf
PPTX
OCR 's Functions
PPTX
DHUG 2018 - Florida Thesis OCR
PPT
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
PPTX
Automating_Bank_Data_Management_Using_OCR_Technology[2] [Autosaved]-1.pptx
PDF
IRJET-Optical Character Recognition using ANN
PPTX
Optical Character Recognition
PPT
IMPACT Final Conference - USAL - Arbitrary warping
PDF
CRC Final Report
DOC
Sanjeev rai
PPTX
ABBYY USA TAWPI presentation
DOCX
Optical character recognition IEEE Paper Study
PPTX
Ethiopic Scrip OCR App Front End and Backend
PDF
Session3 01.clemens neudecker
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PPT
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
PDF
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
PPT
IMPACT Final Conference - USAL - Text line and word segmentation
PDF
CASE-7 Scanning and OCR the Open Source Way
DU_SERIES_Session1.pdf
OCR 's Functions
DHUG 2018 - Florida Thesis OCR
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
Automating_Bank_Data_Management_Using_OCR_Technology[2] [Autosaved]-1.pptx
IRJET-Optical Character Recognition using ANN
Optical Character Recognition
IMPACT Final Conference - USAL - Arbitrary warping
CRC Final Report
Sanjeev rai
ABBYY USA TAWPI presentation
Optical character recognition IEEE Paper Study
Ethiopic Scrip OCR App Front End and Backend
Session3 01.clemens neudecker
OCR-D: An end-to-end open source OCR framework for historical printed documents
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
IMPACT Final Conference - USAL - Text line and word segmentation
CASE-7 Scanning and OCR the Open Source Way
Ad

Recently uploaded (20)

PPTX
Power Point PR B.Inggris 12 Ed. 2019.pptx
PDF
GSA-Past-Papers-2010-2024-2.pdf CSS examination
PPTX
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
PPTX
IT infrastructure and emerging technologies
PDF
Chevening Scholarship Application and Interview Preparation Guide
PDF
FAMILY PLANNING (preventative and social medicine pdf)
PDF
Horaris_Grups_25-26_Definitiu_15_07_25.pdf
PDF
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
PPTX
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
PDF
Unleashing the Potential of the Cultural and creative industries
DOCX
THEORY AND PRACTICE ASSIGNMENT SEMESTER MAY 2025.docx
PPTX
Neurological complocations of systemic disease
PDF
GIÁO ÁN TIẾNG ANH 7 GLOBAL SUCCESS (CẢ NĂM) THEO CÔNG VĂN 5512 (2 CỘT) NĂM HỌ...
PPTX
pharmaceutics-1unit-1-221214121936-550b56aa.pptx
PPTX
MMW-CHAPTER-1-final.pptx major Elementary Education
PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
PPTX
Diploma pharmaceutics notes..helps diploma students
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PPTX
CHROMIUM & Glucose Tolerance Factor.pptx
PPTX
Math 2 Quarter 2 Week 1 Matatag Curriculum
Power Point PR B.Inggris 12 Ed. 2019.pptx
GSA-Past-Papers-2010-2024-2.pdf CSS examination
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
IT infrastructure and emerging technologies
Chevening Scholarship Application and Interview Preparation Guide
FAMILY PLANNING (preventative and social medicine pdf)
Horaris_Grups_25-26_Definitiu_15_07_25.pdf
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
Unleashing the Potential of the Cultural and creative industries
THEORY AND PRACTICE ASSIGNMENT SEMESTER MAY 2025.docx
Neurological complocations of systemic disease
GIÁO ÁN TIẾNG ANH 7 GLOBAL SUCCESS (CẢ NĂM) THEO CÔNG VĂN 5512 (2 CỘT) NĂM HỌ...
pharmaceutics-1unit-1-221214121936-550b56aa.pptx
MMW-CHAPTER-1-final.pptx major Elementary Education
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
Diploma pharmaceutics notes..helps diploma students
Disorder of Endocrine system (1).pdfyyhyyyy
CHROMIUM & Glucose Tolerance Factor.pptx
Math 2 Quarter 2 Week 1 Matatag Curriculum

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

  • 1. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC Laura Mandell, PI, IDHMC Apostolos Antonacopoulos, PRImA Lab Clemens Neudecker, Koninklijke Bibliotheek Matthew Christy, Co-Project Manager, IDHMC Loretta Auvil, SEASR Analytics Todd Samuelson, Cushing Memorial Library emop.tamu.edu
  • 2. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Initial Goals Challenges Or Failures Analysis New Directions Adaptability Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 3. Straight from the grant proposal… “Our overarching goals” 1) Train three open-access OCR engines to “read” early modern fonts 2) Map specific font training onto specific sets of documents 3) Create error-evaluation mechanisms for failed documents 4) Use crowd-sourced correction tools specific to OCR errors 5) Identify pages that are too flawed to be “readable” 6) Share our workflow procedure and results, so that the community can use them in digitizing and transcribing early modern documents. Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 4. Main Collaborators CIIR IDHMC + Cushing Memorial Library Koninklijke Bibliotheek Performant Software Solutions PRImA Labs PSI Labs SEASR UMass Amhearst Texas A&M Netherlands Charlottesville, Virginia University of Salford, Manchester Texas A&M U of Illinois, Urbana-Champaign Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 5. Data Contributors + Collaborators Early English Books Online (EEBO) Eighteenth Century Collections Online (ECCO) Text Creation Partnership (TCP) Brazos Computing Cluster (Texas A&M) Main Collaborators Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 6. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Laura Mandell, Principal Investigator, eMOP Director, IDHMC @mandellc [email protected]
  • 7. Early Modern Printing • Individual, hand-made typefaces • Worn and broken type • Poor quality equipment/paper • Inconsistent line bases • Unusual page layouts, decorative page elements, • Special characters & ligatures • Spelling variations • Mixed typefaces and languages Slides by Matthew Christy 7
  • 8. Slides by Matthew Christy 8 • Irregular Layouts • Print Bleedthrough
  • 9. Document/Image Quality • Torn and damaged pages • Noise introduced to images of pages • Skewed pages • Warped pages • Missing pages • Inverted pages • Incorrect metadata • Extremely low quality TIFFs (~50K) Slides by Matthew Christy 9
  • 10. Slides by Matthew Christy 10
  • 11. 11 There may be as much difference between one letter and another in a specific font As there is between letters in different fonts. Reality Dream Training Tesseract in different fonts and applying them to the documents printed in those particular fonts will improve OCR quality.
  • 12. Training Tesseract Aletheia Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  • 13. Training Tesseract Franken+ 1. Takes Aletheia's output files as input. 2. Groups all glyphs with the same Unicode values into one window for comparison. 3. Mistakenly coded glyphs are easily identified and re-coded. 4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired. 5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. 6. Outputs the same box files and TIFF images that Tesseract's first stage of native training. 7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files. 8. Outputs a .traineddata file used by Tesseract when OCRing page images. Slides by Matthew Christy 13
  • 14. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Clemens Neudecker, Koninklijke Bibliotheek @cneudecker
  • 15. The case of IMPAC T • IMPACT = IMProving ACcess to Text • EU FP7, 2008 – 2012 • €16.7 M budget • 22 partners (libraries, universities, companies) • Goal: Significantly improve OCR for historical documents
  • 16. Issue 1 • Expectation: The "IMPACT OCR" • Reality: A collection of very diverse tools, algorithms, etc. Some prototypes, some commercial tools, different programming languages, different levels of maturity etc. • • No integrated product possible!
  • 17. Issue 1 • Solution: Interoperability rather than integration • Change: Individual applications as pluggable modules in a web-based framework • Result: Flexible framework with additional benefits for testing, transparency, provenance
  • 18. Issue 2 • Diversity: Librarians, Computer Scientists, Computational Linguists, Humanists • Are we really talking the same language? • Different focus points in the project: applicable solutions vs. academic publications
  • 19. Issue 2 • Solution: Create bonding activities, foster atmosphere for knowledge exchange • Change: Buddy programme, social games, quizzes about partners • Result: Understand your partners background, their way of thinking enrich the experience for everyone
  • 20. Large Digitisation Projects: Two Key Perspectives Apostolos Antonacopoulos PRImA Research Lab
  • 21. Background Since 2002 the PRImA Lab has been involved in large digitisation projects, creating software tools for all stages of the workflow • From Image Enhancement to Layout Analysis to OCR • Use-scenario based evaluation of extracted text quality • Crowd/Scholar-sourcing Two general points are routinely underestimated: • (Really) Understanding stakeholders and their roles • (Real) Understanding of problems, their extent and the effectiveness/requirements of potential solutions
  • 22. Stakeholders and their roles Seems obvious and often mentioned but the significance of understanding this point and its effects is vastly underestimated Content holders • Keen for their content to be widely available and used • Do not know their content well and neither its potential uses Computer scientists • Have technical expertise to solve many of the problems • Do not know the material and its use to prioritise problems well DH researchers – the catalysts • Very knowledgeable of material and potential use • Have complementary technical skills to computer scientists
  • 23. Problem understanding At the start of each project everyone is eager to deliver “big” results but it is important to identify and understand a few key problems and solve them well “Improve OCR results” is an ill-defined and short-sighted goal • Measured in terms of word-accuracy, OCR results are of little use • Layout is very important • Even if all the words are recognised correctly, the reading order is unlikely to be correct, limiting potentially interesting uses. • Page numbers, captions, running headers etc. should not be mixed with body text • Graphical elements / illustrations are important too Think: Useful data (investment) vs. just more of any data (instant gratification)
  • 24. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC @EMGrumbach [email protected]
  • 25. “If an electronic scholarly project can’t fail and doesn’t produce new ignorance, then it isn’t worth a damn.” - John Unsworth “Documenting the Reinvention of Text: The Importance of Failure” Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 26. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Initial Goals Challenges Or Failures Analysis New Directions Adaptability Navigating the Storm | @EMGrumbach | emop.tamu.edu
  • 27. Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Challenges Or Failures Analysis New Directions Adaptability Challenges +Failures should be constantly or consistently communicated. Analysis + New Directions should lead to research and communication with similar projects. Adaptability should allow for new possibilities, new questions. Navigating the Storm | @EMGrumbach | emop.tamu.edu

Editor's Notes

  • #3: sf
  • #4: eMOP – early modern OCR project, funded by the Mellon foundation for 734,000 for two years, and our initial goals were the following
  • #26: Influenced us a lot when we were discussing putting together this paper and presentation, as we’ve all come to this international, interdisciplinary grant project from similar projects – we’ve faced challenges to our initial premises, we’ve not met milestone in the grant, yet we’ve produced interesting results and raised new research questions