Mining Bioscience Literature
Peter Murray-Rust,
University of Cambridge and TheContentMine
SynBio, Cambridge UK, 2015-05-18
Much Scientific Data lies hidden in text and images, in articles, theses,
reports, patents, lab-books…
The ContentMine has Open collaborative tools that anyone can use to
find facts and re-use for their own research
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
Adage in public health: “The road to inaction is paved with research
papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
Scientific and Medical publication (STM)[+]
• World Citizens pay $400,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” [*]…
• … $10,000,000,000 from academic libraries …
• … to “publishers” who forbid access to 99.9% of citizens of
the world …
• 85% of medical research is wasted (not published, badly
conceived, duplicated, …)
[+] Figures probably +- 50 %
[*] arXiV preprint server costs $7 USD per paper
The Right to Read is the Right to Mine
http://contentmine.org
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
OUR TEAM
@jenny_molloy
Ross Mounce
@rmounce
Richard Smith-
Unna
@blahah404
Stephanie Smith-
Unna
@treblesteph
Jenny Molloy
Mark
MacGillivray
@cottagelabs
Peter Murray-
Rust
@petermurrayrust
Charles Oppenheim
@CharlesOppenh
Graham
Steel
@McDawg
Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
• JISC , London
2015
• LIBER
• Cochrane
• BL
• Wellcome Trust (April)
• WHO
Collaborators
• Wikimedia/Wikidata
• Mozilla
• Open Knowledge
• LIBER (European Research Libraries)
• British Library
• Wellcome Trust
• EBI (Eur. Bioinf. Inst.)
• JISC
• Open Access Button
• SPARC
• Creative Commons
• CORE
• EuropePubmedCentral
Content-Mining (TDM*)
• Now COMPLETELY LEGAL IN UK since 2014-06-01
(“Hargreaves”)…
• … Whatever the publishers tell you. Do NOT sign
their APIs
• UK can legally IGNORE contractual restrictions
• Movement to extend this to Europe (Julia Reda,
MEP proposal)
• And STM publishers are spending millions to stop
us
*Text and Data Mining
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
What is “Content”?
Emily Sena (neuroscience.ed.ac.uk) spends
half a day digitising a diagram like this
ContentMine will soon be able to do it in 1 second
• CRAWL the web for scientific documents
(articles, grey literature, repositories)
• quickSCRAPE pages (text, graphics, images, data)
• NORMA-lize page to semantic form
…Open semantic science …
• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index
• Automate daily process (CANARY)
contentmine.org Infrastructure
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-
enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
http://catalogue.cottagelabs.com/browsehttp://catalogue.cottagelabs.com/graph
quickscrape
Crawl
Feed
Norma Index &
Transform
PDF
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
PLoSONE BMC
1
BMC
2
Closed1 Closed2Hybrid
CATalog
Enhanced annotated
articles
FACTSFACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Linked OpenData
Semantic
Scientific Objects
2000-5000
Articles
CORE Repository UK
PLoSONE BMC
1
BMC
2
Closed1 Closed2Hybrid
CATalog
FACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Open3 Closed3
Selected
Retrospective
REPO
Articles
Theses
Reports
Patents
FACTS
Peter
Murray-Rust
BMC publisher
Blue Obelisk paper (20
co-authors)
Sub-network
From CATalog
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
Retrieval/Extraction Technologies
• Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model)
• Term-Frequency Inverse-Document-Frequency
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
• Regular Expressions
• Templates (Information Extraction)
• Natural Language Processing (NLP)
• Image processing and mining
• Lookup (Wikidata, Bioscience databases)
Bag of Words
Theses from HAL repository
Species
Regex for Clinical Trials
CLINICAL TRIALS
How to we find (mentions of) clinical trials?
Is a document a (clinical) trial?
What is the subject of the trial?
What is the methodology used? How many/long?
Does the design and practice conform to CONSORT?
What are the outcomes?
Can we extract specific re-usable information?
Who are involved? (researchers, sponsors, patients?)
Has a proposed trial been completed and reported?
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
Parsing chemical sentences
This could be extended to much other scientific language
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points
Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction
Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma
AMI
23.12
34.54
37.21
38.55
Posterior
probability
AMI can MEASURE
Branch lengths!
NexML
Genus Family
HTML
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
Problems
• Cannot do handwriting
• Scanned documents give poorer results
• The older the document the poorer the result
• Tables are a major problem
• Always try to get the original document
• XML better than > Word better than > PDF
• Vector images >> PNG > JPEG
• Maths, chemistry are specialist
POSSIBLE USES
• Indexing/searching the literature; G***** for science
• Current awareness; alerts and practices
• Extraction and re-use of facts; re-computation
• Multidisciplinary integration; co-occurrence
• Compliance with funder/institution policies
• Managing your Research Data!
• Finding similar and complementary colleagues
• Reproducibility, checking data and avoiding fraud
ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Join our Open-Source community at
http://www.contentmine.org

More Related Content

PPTX
Mining the scientific literature for plants and chemistry
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
ContentMining in Neuroscience
PPTX
Towards Responsible Content Mining: A Cambridge perspective
PPTX
High throughput mining of the scholarly literature; talk at NIH
PPTX
High throughput mining of the scholarly literature
PPTX
Why ContentMining is useful
Mining the scientific literature for plants and chemistry
Can Computers understand the scientific literature (includes compscie material)
ContentMine: Liberating scholarship from Open publications and theses
ContentMining in Neuroscience
Towards Responsible Content Mining: A Cambridge perspective
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature
Why ContentMining is useful

What's hot (20)

PPTX
Why ContentMining is useful
PPTX
Content Mining at Wellcome Trust
PPTX
Content Mining at Wellcome Trust
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
The Content Mine (presented at UKSG)
PPTX
Mining Scientific Images
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
High throughput mining of the scholarly literature
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Open software and knowledge for MIOSS
PPTX
ContentMine and WikiData
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
The culture of researchData
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Biovision2017 Accessing the scientific literature
PPTX
Open Notebook Science
PDF
Workshop 5: Uptake of, and concepts in text and data mining
Why ContentMining is useful
Content Mining at Wellcome Trust
Content Mining at Wellcome Trust
Amanuens.is HUmans and machines annotating scholarly literature
Open software and knowledge for MIOSS
Amanuens.is HUmans and machines annotating scholarly literature
Automatic Extraction of Knowledge from the Literature
The Content Mine (presented at UKSG)
Mining Scientific Images
Automatic Extraction of Knowledge from Biomedical literature
High throughput mining of the scholarly literature
Can Computers understand the scientific literature (includes compscie material)
Open software and knowledge for MIOSS
ContentMine and WikiData
Liberating facts from the scientific literature - Jisc Digifest 2016
The culture of researchData
Automatic Extraction of Knowledge from the Literature
Biovision2017 Accessing the scientific literature
Open Notebook Science
Workshop 5: Uptake of, and concepts in text and data mining
Ad

Viewers also liked (17)

PPTX
ContentMining in Neuroscience
PPTX
Digital Scholarship: Enlightenment or Devastated Landscape?
PPTX
Principles and practice of Open Science
PPTX
Content Mining of Science in Cambridge
PPTX
Embrace the Open Revolution
PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
ContentMine and WikiData
PPTX
Content Mining of Science in Europe
PPTX
Content Mining of Science in Europe
PPTX
Copyright Reform and Open Data
PPTX
ContentMine: Open Data and Social Machines
PPTX
Content Mining for Machines and Humans
PPTX
Making Theses USEFUL
PPTX
PLOS slides
PPTX
Content Mining of Science and Medicine
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
TheContentMine: Mining for Everyone
ContentMining in Neuroscience
Digital Scholarship: Enlightenment or Devastated Landscape?
Principles and practice of Open Science
Content Mining of Science in Cambridge
Embrace the Open Revolution
Open Knowledge and University of Cambridge European Bioinformatics Institute
ContentMine and WikiData
Content Mining of Science in Europe
Content Mining of Science in Europe
Copyright Reform and Open Data
ContentMine: Open Data and Social Machines
Content Mining for Machines and Humans
Making Theses USEFUL
PLOS slides
Content Mining of Science and Medicine
Automatic Extraction of Science and Medicine from the scholarly literature
TheContentMine: Mining for Everyone
Ad

Similar to ContentMining for Synthetic Biology (20)

PPTX
ContentMining in Neuroscience
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
Content Mining of Science in Europe
PPTX
Content Mining for Machines and Humans
PPTX
ContentMining at Cambridge
PPTX
Big Data and ContentMining for Libraries
PPTX
ContentMining and Clinical Trials
PPTX
ContentMining and Clinical Trials
PPTX
ContentMine: Mining the Scientific Literature
PPTX
Can machines understand the scientific literature
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
PPTX
ContentMine (TDM) at JISC Digifest
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
High throughput mining of the scholarly literature: journals and theses
PPTX
ContentMine: Open Data and Social Machines
PPTX
Scientific search for everyone
PPTX
Can machines understand the scientific literature?
PDF
ContentMine Presentation for WHO Health Data Seminar
PPTX
Petermrjisc20141201
PPTX
High throughput mining of the plant-science literature
ContentMining in Neuroscience
ContentMine: Liberating scholarship from Open publications and theses
Content Mining of Science in Europe
Content Mining for Machines and Humans
ContentMining at Cambridge
Big Data and ContentMining for Libraries
ContentMining and Clinical Trials
ContentMining and Clinical Trials
ContentMine: Mining the Scientific Literature
Can machines understand the scientific literature
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMine (TDM) at JISC Digifest
Liberating facts from the scientific literature - Jisc Digifest 2016
High throughput mining of the scholarly literature: journals and theses
ContentMine: Open Data and Social Machines
Scientific search for everyone
Can machines understand the scientific literature?
ContentMine Presentation for WHO Health Data Seminar
Petermrjisc20141201
High throughput mining of the plant-science literature

More from TheContentMine (8)

PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPSX
Cochrane workshop 2016
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
Mining Scientific Diagrams for facts
PPTX
OpenNotebookScience NOW!
PPTX
Open Data and Open Science
PPTX
Disruptive Communities and Technology
PPTX
Overview of Practical Content Mining
Automatic Extraction of Knowledge from Biomedical literature
Cochrane workshop 2016
ContentMine + EPMC: Finding Zika!
Mining Scientific Diagrams for facts
OpenNotebookScience NOW!
Open Data and Open Science
Disruptive Communities and Technology
Overview of Practical Content Mining

Recently uploaded (20)

PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
CapCut PRO for PC Crack New Download (Fully Activated 2025)
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
What Makes a Great Data Visualization Consulting Service.pdf
PDF
Guide to Food Delivery App Development.pdf
PPTX
R-Studio Crack Free Download 2025 Latest
PDF
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
PPTX
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PPTX
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
PPTX
Bista Solutions Advanced Accounting Package
PPTX
Chapter 1 - Transaction Processing and Mgt.pptx
PDF
Visual explanation of Dijkstra's Algorithm using Python
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
PDF
Cloud Native Aachen Meetup - Aug 21, 2025
PDF
Sun and Bloombase Spitfire StoreSafe End-to-end Storage Security Solution
PPTX
Cybersecurity: Protecting the Digital World
BoxLang Dynamic AWS Lambda - Japan Edition
CapCut PRO for PC Crack New Download (Fully Activated 2025)
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
What Makes a Great Data Visualization Consulting Service.pdf
Guide to Food Delivery App Development.pdf
R-Studio Crack Free Download 2025 Latest
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
Airline CRS | Airline CRS Systems | CRS System
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
Bista Solutions Advanced Accounting Package
Chapter 1 - Transaction Processing and Mgt.pptx
Visual explanation of Dijkstra's Algorithm using Python
Full-Stack Developer Courses That Actually Land You Jobs
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
Cloud Native Aachen Meetup - Aug 21, 2025
Sun and Bloombase Spitfire StoreSafe End-to-end Storage Security Solution
Cybersecurity: Protecting the Digital World

ContentMining for Synthetic Biology

Editor's Notes

  • #2: Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
  • #9: Because information is structured (some examples listed), we can aggregate similar objects and mine using a modular systematic approach.
  • #27: Because information is structured (some examples listed), we can aggregate similar objects and mine using a modular systematic approach.
  • #39: Can describe each collaboration, but keep this slide brief if the presentation is short.