Open Mining of the Bioscience Literature
Peter Murray-Rust,
ContentMine.org and the University of Cambridge
UNAM, MX 2015-10-09
Millions of data points are hidden in the bioscience literature.
ContentMine has Open technology to liberate them automatically.
Using OpenNotebook approaches
The major problem is politico-legal
This is an exploratory talk, looking for ideas and projects
The future depends on young people
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
Panton Authors and Fellows
Some particularly relevant Fellows/Alumni and projects:
• Rufus Pollock: Open Knowledge Foundation
• Mark Surman: Mozilla
• Dan Whaley: Hypothes.is
• Daniel Lombrana-Gonzales: PyBossa/Crowdcrafting
Erin McKiernan, 2015 Flash Award
ContentMine and Peter Murray-Rust are funded by:
The Right to Read is the Right to Mine
http://contentmine.org
ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Typical scientific paper
Why do we publish science?
• Communicate our results
• Archival
• Get feedback from peers.
• Provide material that others can re-use.
• Priority and esteem.
ContentMining in Neuroscience
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
[Liberian Ministry of Health] were stunned recently when we stumbled across
an article by European researchers in Annals of Virology [1982]: “The results
seem to indicate that Liberia has to be included in the Ebola virus endemic
zone.” In the future, the authors asserted, “medical personnel in Liberian health
centers should be aware of the possibility that they may come across active
cases and thus be prepared to avoid nosocomial epidemics,” referring to
hospital-acquired infection.
Adage in public health: “The road to inaction is paved with research
papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
Re-use
You cannot assume how others will want to re-use your
work.
PM-R’s “first real paper”, doing science by
re-using the results of others in a novel way
1974:
Each point is a separate paper!
Needing 1-4 hours
in library – discovery,hardcopy delivery,
Transcription, hand calculation.
1976-9:
PMR and WDSM developed software
And protocols to search and analyze
Cambridge Crystallographic DB
We need machines to read the
literature
Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201,507 [1] per month
1.5 million (papers + supplemental data) /year [citation needed]*
each 3 mm thick
 4500 m high per year [2]
* Most is not Publicly readable
[1] http://www.crossref.org/01company/crossref_indicators.html
Scientific and Medical publication (STM)[+]
• World Citizens pay $450,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” [*]…
• … $10,000,000,000 from academic libraries …
• … to “publishers” who forbid access to 99.9% of citizens of
the world …
• 85% of medical research is wasted (not published, badly
conceived, duplicated, …) [Lancet 2009]
[+] Figures probably +- 50 %
[*] arXiV preprint server costs $7 USD per paper
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
ContentMine approaches
0. Open software, Open content, Open notebooks
1. Daily liberation of facts which are easy and widely
useful.
– Species (Bacillus subtilis, Okapia johnstoni)
– Genes (BRCA1*, APOE)
– Chemicals (acetone, CH3OH)
– Identifiers (RRIDs, museum specimens, )
1. CMunities of practice with bespoke tools:
– Clinical Trials
– Phylogenetic trees
– Systematic reviews
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
After AMI2 processing…..
… AMI2 has detected a square
ContentMining in Neuroscience
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
DAILY
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
ContentMining in Neuroscience
http://opentrials.net/
ContentMine will work with OpenTrials
“adult nonpregnant patients, aged ≥18 years”,
“randomization sequence using a permuted block design with random
block sizes stratified by study center”.
“blinding of the patients and caregivers is not possible”.
“Investigators performing analysis are blinded for the intervention”.
“Continuous normally distributed variables … mean and standard deviation,
counts (n) and percentages (%). … Student’s t-test … or the Mann–Whitney U test
… Categorical … Chi-square test or Fisher's exact tests. Statistical significance is
considered to be at a P value <0.05 …”
Formulaic language in reporting clinical trials
Text-based plugins
• Bag of words
(https://en.wikipedia.org/wiki/Bag-of-
words_model)
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf
(Term-frequency, inverse document frequency)
• Templates and regexes (regular expressions).
“Bag of Words”
Three fulltext articles from trialsjournal.com
Regular Expressions for Systematic Reviews of Animal Tests
Preceding Text
Following Text
Extracted term
In 30 minutes 6 scientists (most were unfamiliar with regex)
wrote 200 regexes for ARRIVE (NC3R guidelines)
TEMPLATES
https://en.wikipedia.org/wiki/Consolidated_Standards_of_Reporting_Trials
Some communities have standard
Reporting, which helps extraction
ContentMining in Neuroscience
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
ContentMining in Neuroscience
UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points
Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction
PLUTo
Aves
Apterygidae
Marsupialia
Monotremata
Mammalia
Reptilia
Amphibia
Arthropoda
Myriapodia
Okapia johnstoni
Pyrus
Stuffed Tree of Life
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
PMR’s Tribute
Planned Memorial Meeting
July 14th 2014 Cambridge
OPEN NOTEBOOK SCIENCE
Traditional Research and Publication
“Lab” work paper/th
esis
Write
rewrite
Re-experiment
publish
???
Validation??
DATA
output “belongs”
to publisher
TOOLS
Open Notebook Science
Open
engineered
repository
World
community
INSTRUMENT
validate
merge
MODEL
CODE
DATA
DATA
knowledge
calibrate
Problems are solved communally;
Nothing is needlessly duplicated; “publication“ is
continuous ; data are SEMANTIC
Machines
and humans
Working
together
Open Notebook Content Mining
• “No insider knowledge”
• Anyone can become involved
• All raw non-copyright material on Github
• Planning and discussion on Open Discourse
• All output (however imperfect) on Github CC0
• Immediate upload
• Inspired by Free/Libre/Open Source, Wikipedia,
Open StreetMap.
4300 images
“Root”
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga
_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_te
rrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleat
um:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
ContentMining in Neuroscience
Automatic Open Notebook of computations
Everything is posted to Github before being analyzed
Bacillus subtilis [131238]*
Bacteroides fragilis [221817]
Brevibacillus brevis
Cyclobacterium marinum
Escherichia coli [25419]
Filobacillus milosensis
Flectobacillus major [15809775]
Flexibacter flexilis [15809789]
Formosa algae
Gelidibacter algens [16982233]
Halobacillus halophilus
Lentibacillus salicampi [18345921]
Octadecabacter arcticus
Psychroflexus torquis [16988834]
Pseudomonas aeruginosa [31856]
Sagittula stellata [16992371]
Salegentibacter salegens
Sphingobacterium spiritivorum
Terrabacter tumescens
• [Identifier in Wikidata]
• Missing = not found with Wikidata API
20 commonest organisms (in > 30 papers) in trees from IJSEM*
Half do not appear to be in Wikidata
Can the Wikipedia Scientists comment?
*Int. J. Syst. Evol. Microbiol.
Supertree for 924 species
Tree
Supertree created from 4300 papers
Minor branch
Part of major branch
Part of major branch
Ideas for Neuroscience
Can we extract digital information from
published electroneurophysiology traces?...
…and build super-information?
ContentMining in Neuroscience
ContentMining in Neuroscience
ContentMining in Neuroscience
Raw trace (pixels)
Thinned trace (pixels)
Line segments (SVG)
Reconstructed trace (SVG)
Extraction into data format (CSV, Excel)
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
Peter
Murray-Rust
BMC publisher
Blue Obelisk paper (20
co-authors)
Sub-network
From CATalog
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
The problem
©
Prof. Ian Hargreaves (2011): "David Cameron's
exam question”: "Could it be true that laws
designed more than three centuries ago with the
express purpose of creating economic incentives
for innovation by protecting creators' rights are
today obstructing innovation and economic
growth?”
“yes. We have found that the UK's intellectual
property framework, especially with regard to
copyright, is falling behind what is needed.” "Digital
Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia -
https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg
Elsevier wants to control Open Data
[asked by Michelle Brook]
http://www.epip2015.org/copyright-wars-frozen-conflict/
UPDATE 20150902: Ian Hargreaves "the voices of the digital many should not be
drowned out by the digital self-interested few"
ContentMining in Neuroscience
contentmine.org team
ContentMining in Neuroscience

More Related Content

PPTX
Content Mining of Science in Europe
PPTX
Content Mining at Wellcome Trust
PPTX
ContentMining in Neuroscience
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
The Content Mine (presented at UKSG)
PPTX
Open Notebook Science
PPTX
The culture of researchData
Content Mining of Science in Europe
Content Mining at Wellcome Trust
ContentMining in Neuroscience
ContentMine: Liberating scholarship from Open publications and theses
Automatic Extraction of Knowledge from the Literature
The Content Mine (presented at UKSG)
Open Notebook Science
The culture of researchData

What's hot (20)

PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
Open software and knowledge for MIOSS
PPTX
ContentMine (TDM) at JISC Digifest
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Content Mining of Science in Cambridge
PPTX
Open data and Open Science
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Content Mining for Machines and Humans
PPTX
Open software and knowledge for MIOSS
PPTX
Petermrjisc20141201
PPTX
Cochrane workshop2016
PPTX
High throughput mining of the scholarly literature
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
High throughput mining of the scholarly literature
PPTX
ContentMine and WikiData
PDF
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
PPTX
Mining the scientific literature for plants and chemistry
Automatic Extraction of Knowledge from Biomedical literature
Open software and knowledge for MIOSS
ContentMine (TDM) at JISC Digifest
Liberating facts from the scientific literature - Jisc Digifest 2016
Automatic Extraction of Knowledge from the Literature
Amanuens.is HUmans and machines annotating scholarly literature
Content Mining of Science in Cambridge
Open data and Open Science
Amanuens.is HUmans and machines annotating scholarly literature
Content Mining for Machines and Humans
Open software and knowledge for MIOSS
Petermrjisc20141201
Cochrane workshop2016
High throughput mining of the scholarly literature
ContentMine + EPMC: Finding Zika!
High throughput mining of the scholarly literature
ContentMine and WikiData
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Mining the scientific literature for plants and chemistry
Ad

Viewers also liked (11)

PPTX
The culture of researchData
PPTX
ContentMine Architecture
PPTX
A Global Commons for Scientific Data: Molecules and Wikidata
PPTX
Principles and practice of Open Science
PPTX
High throughput mining of the scholarly literature; talk at NIH
PPTX
High throughput mining of the plant-science literature
PPTX
Making Theses USEFUL
PPTX
Disruptive Communities and Technology
PPTX
PPTX
Copyright Reform and Open Data
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
The culture of researchData
ContentMine Architecture
A Global Commons for Scientific Data: Molecules and Wikidata
Principles and practice of Open Science
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the plant-science literature
Making Theses USEFUL
Disruptive Communities and Technology
Copyright Reform and Open Data
ContentMining for France and Europe; Lessons from 2 years in UK
Ad

Similar to ContentMining in Neuroscience (20)

PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
PPTX
Why ContentMining is useful
PPTX
Why ContentMining is useful
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
Content Mining at Wellcome Trust
PPTX
Scientific search for everyone
PPTX
Towards Responsible Content Mining: A Cambridge perspective
PPTX
ContentMine: Mining the Scientific Literature
PPTX
Content Mining of Science in Europe
PPTX
Content Mining of Science in Europe
PPTX
Big Data and ContentMining for Libraries
PPTX
Can machines understand the scientific literature
PPTX
Can machines understand the scientific literature?
PPTX
ContentMining at Cambridge
PDF
ContentMine Presentation for WHO Health Data Seminar
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
High throughput mining of the scholarly literature: journals and theses
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Overview of Practical Content Mining
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Why ContentMining is useful
Why ContentMining is useful
ContentMine: Liberating scholarship from Open publications and theses
Content Mining at Wellcome Trust
Scientific search for everyone
Towards Responsible Content Mining: A Cambridge perspective
ContentMine: Mining the Scientific Literature
Content Mining of Science in Europe
Content Mining of Science in Europe
Big Data and ContentMining for Libraries
Can machines understand the scientific literature
Can machines understand the scientific literature?
ContentMining at Cambridge
ContentMine Presentation for WHO Health Data Seminar
Liberating facts from the scientific literature - Jisc Digifest 2016
High throughput mining of the scholarly literature: journals and theses
Can Computers understand the scientific literature (includes compscie material)
Overview of Practical Content Mining

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
PPTX
WikiFactMine: Science for Everyone
PPTX
ContentMining and Copyright at CopyCamp2017
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism
WikiFactMine: Science for Everyone
ContentMining and Copyright at CopyCamp2017

Recently uploaded (20)

PPTX
Targeted drug delivery system 1_44299_BP704T_03-12-2024.pptx
PPTX
Thyroid disorders presentation for MBBS.pptx
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PDF
Exploring PCR Techniques and Applications
PPTX
Heart Lung Preparation_Pressure_Volume.pptx
PPT
ecg for noob ecg interpretation ecg recall
PDF
Glycolysis by Rishikanta Usham, Dhanamanjuri University
PDF
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPTX
Spectroscopy techniques in forensic science _ppt.pptx
PDF
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
PDF
Chapter 3 - Human Development Poweroint presentation
PPTX
diabetes and its complications nephropathy neuropathy
PDF
Chemistry and Changes 8th Grade Science .pdf
PPTX
EPILEPSY UPDATE in kkm malaysia today new
PPTX
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
PDF
Social preventive and pharmacy. Pdf
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPTX
02_OpenStax_Chemistry_Slides_20180406 copy.pptx
Targeted drug delivery system 1_44299_BP704T_03-12-2024.pptx
Thyroid disorders presentation for MBBS.pptx
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
Exploring PCR Techniques and Applications
Heart Lung Preparation_Pressure_Volume.pptx
ecg for noob ecg interpretation ecg recall
Glycolysis by Rishikanta Usham, Dhanamanjuri University
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
Spectroscopy techniques in forensic science _ppt.pptx
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
Chapter 3 - Human Development Poweroint presentation
diabetes and its complications nephropathy neuropathy
Chemistry and Changes 8th Grade Science .pdf
EPILEPSY UPDATE in kkm malaysia today new
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
Social preventive and pharmacy. Pdf
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
Enhancing Laboratory Quality Through ISO 15189 Compliance
02_OpenStax_Chemistry_Slides_20180406 copy.pptx

ContentMining in Neuroscience

Editor's Notes

  • #2: Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
  • #23: ChemBark