Content-Mining for Clinical Trials
Peter Murray-Rust
contentmine.org
Cochrane UK, Oxford, 2015-03-16
• OPEN Platform for Machines+humans
to automatically “read” the trials
literature
• Grow communities and give everyone
the tools and know-how to mine trials
• 09:30 - Introductions
10:00 - Overview of ContentMine
10:30 - Discussion: why might content mining clinical
trials be useful?
11:00 - Tea/coffee break
11:15 - Discussion: current tools and what is needed
12:00 - Discussion: imagining the clinical trials mining
pipeline
12:30 - Lunch
13:30 - Demo and introduction to software
14:30 - Technical session 1 (hands-on content mining)
15:30 - Tea/coffee break
15:45 - Technical session 2 (hands-on content mining)
17:00 - Event close
Background for Today
• Contentmine aims to make large areas of scientific fact OPEN (100
million facts/year)
• We’re working with WellcomeTrust, Europe PubMedCentral, etc.
• A politically “hot” area (Hargreaves legislation, EU activity)
• A week ago WellcomeTrust workshop on TDM and Neuroscience;
“rough consensus” on what was needed.
• In the last few days we’ve prototyped what we think is a good
starting point…
• NOTE: The software is very “bleeding edge”! Please treat in a spirit
of adventure!!
• Vision/enthusiasm from Amy Price, Anna Noel-Storr, Emily Sena
(E’burgh) and yourselves!
Questions we could tackle
• How to we find (mentions of) clinical trials?
• Is a document a (clinical) trial?
• What is the subject of the trial?
• What is the methodology used?
• Does the design and practice conform to CONSORT?
• What are the outcomes?
• Can we extract specific re-usable information?
• Who are involved? (researchers, sponsors, patients?)
• Has a proposed trial been completed and reported?
Afternoon session
• Work in groups; mixture of skills and
experience
• Take different sections of CONSORT
• Scrape articles from trialsjournal.com
• Explore word frequency – create your own
lists of frequent words
• Design regexes to extract CONSORT 8a->11
The Right to Read is the Right to Mine
http://contentmine.org
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-
enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
ContentMining and Clinical Trials
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
What is “Content”?
Machine-Human symbioses
• Wikipedia
• Open StreetMap
• Google
We aim to make it trivial for a human+machine
to mine the scientific literature.
By building Communities
ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
• JISC , London
Upcoming
• LIBER
• Cochrane
• BL
• Wellcome Trust (April)
• WHO
Collaborators
• Wikimedia/Wikidata
• Mozilla
• Open Knowledge
• LIBER (European Research Libraries)
• British Library
• Wellcome Trust
• EBI (Eur. Bioinf. Inst.)
• JISC
• Open Access Button
• SPARC
• Creative Commons
• CORE
• EuropePubmedCentral
• CRAWL the web for scientific documents
(articles, grey literature, repositories)
• quickSCRAPE pages (text, graphics, images, data)
• NORMA-lize page to semantic form
…Open semantic science …
• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index
• Automate daily process (CANARY)
contentmine.org Infrastructure
quickscrape
Crawl
Feed
Norma Index &
Transform
PDF
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg
CRAWLing the Literature
NO Central Table of Contents
Massive technical, political, legal opposition
Little interest from Academia
Tedious
Few general tools
The Right to Read is The Right To Mine
PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/
SCRAPE
https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain
PDF
HTML
XML quickscrape*
*Scrapers created by
Richard Smith-Unna +
Community
HTML
PDF
XML
PNG
SVG
CSV
DOC
LaTeX
CIF
…
Non-standard per-publisher site
https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain
NORMA-lization of Scientific Literature
PDFs, Broken HTML
PNGs for Math, etc.
NORMA
Unicode
Diacritics
Well-formed
Sectioned
Tagged
SVG diagrams
AMI-plugins
• BagOfWords, Stemming and Regular Expressions
• Species
• Biological Sequences
• Chemical compounds & reactions
• Farming * (Rory Aaronson)
• Crystallography * (Saulius Grazulis, COD)
• Clinical Trials * (Amy Price)
• Phylogenetics * (Ross Mounce)
• Phytochemistry * (Chris Steinbeck, PMR)
* subcommunities
Text-based plugins
• Bag of words
(https://en.wikipedia.org/wiki/Bag-of-
words_model)
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf
(Term-frequency, inverse document frequency)
• Templates and regexes (regular expressions).
“Bag of Words”
Three fulltext articles from trialsjournal.com
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
Advanced Plugins
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
ContentMining and Clinical Trials
ContentMining and Clinical Trials
UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points
Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Smoothing
Gaussian Filter
Automatic
extraction
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
contentmine.org proposed Services
• Workshops
• Repository indexing
• Funder Compliance
• Publication enhancement
• Extraction of scientific data
contentmine.org team

More Related Content

PPTX
Content Mining for Machines and Humans
PPTX
OpenNotebookScience NOW!
PPTX
Copyright Reform and Open Data
PPTX
Open data and Open Science
PPTX
ContentMine and WikiData
PPTX
The Content Mine (presented at UKSG)
PPTX
Open Notebook Science
Content Mining for Machines and Humans
OpenNotebookScience NOW!
Copyright Reform and Open Data
Open data and Open Science
ContentMine and WikiData
The Content Mine (presented at UKSG)
Open Notebook Science

What's hot (20)

PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Making Theses USEFUL
PPTX
ContentMine: Open Data and Social Machines
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
Petermrjisc20141201
PPTX
Disruptive Communities and Technology
PPTX
Principles and practice of Open Science
PPTX
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Climate Change and Human Migration
PPTX
Automatic mining of data from materials science literature
PPTX
Big Data and ContentMining for Libraries
PPTX
ContentMining and Copyright at CopyCamp2017
PPTX
Embrace the Open Revolution
PPTX
Ontologies in Physical Science
PPTX
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
PPTX
Content Mining for Machines and Humans
PPTX
Learn to speak open
PDF
Christine borgman keynote
PPT
Open access for researchers, policy makers and research managers, libraries
Can Computers understand the scientific literature (includes compscie material)
Making Theses USEFUL
ContentMine: Open Data and Social Machines
ContentMine: Liberating scholarship from Open publications and theses
Petermrjisc20141201
Disruptive Communities and Technology
Principles and practice of Open Science
Disrupting the Publisher-Academic Complex
Climate Change and Human Migration
Automatic mining of data from materials science literature
Big Data and ContentMining for Libraries
ContentMining and Copyright at CopyCamp2017
Embrace the Open Revolution
Ontologies in Physical Science
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
Content Mining for Machines and Humans
Learn to speak open
Christine borgman keynote
Open access for researchers, policy makers and research managers, libraries
Ad

Similar to ContentMining and Clinical Trials (20)

PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
ContentMine: Open Data and Social Machines
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
ContentMining in Neuroscience
PPTX
ContentMining in Neuroscience
PPTX
ContentMining in Neuroscience
PPTX
The culture of researchData
PPTX
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
PPTX
Benefits and practice of open science
ODP
Scholarship in a connected world: New ways to know, new ways to show
PPTX
ContentMine and WikiData
PPTX
The culture of researchData
PPTX
The Culture of Research Data, by Peter Murray-Rust
PPTX
ContentMining for Synthetic Biology
PPTX
ContentMining for Synthetic Biology
PPT
Science 2.0
PDF
Media, information and the promise of new technologies in Knowledge Transfer ...
PPTX
Open sciencerefresher2019
PDF
CAEPIA 2011
Paradise Lost and The Right to Read is the Right to Mine
ContentMine: Open Data and Social Machines
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
ContentMining in Neuroscience
ContentMining in Neuroscience
ContentMining in Neuroscience
The culture of researchData
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
Benefits and practice of open science
Scholarship in a connected world: New ways to know, new ways to show
ContentMine and WikiData
The culture of researchData
The Culture of Research Data, by Peter Murray-Rust
ContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Science 2.0
Media, information and the promise of new technologies in Knowledge Transfer ...
Open sciencerefresher2019
CAEPIA 2011
Ad

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Young people in an Age of Knowledge Neocolonialism
PPTX
WikiFactMine: Science for Everyone
PDF
WikiFactMine for Plant Chemistry
PPTX
ContentMine: Mining the Scientific Literature
PPTX
Biovision2017 Accessing the scientific literature
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Scientific search for everyone
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Young people in an Age of Knowledge Neocolonialism
WikiFactMine: Science for Everyone
WikiFactMine for Plant Chemistry
ContentMine: Mining the Scientific Literature
Biovision2017 Accessing the scientific literature

Recently uploaded (20)

PPTX
presentation on dengue and its management
PPTX
PLANNING in nursing administration study
PPTX
Genetics and health: study of genes and their roles in inheritance
PPTX
المحاضرة الثالثة Urosurgery (Inflammation).pptx
PPTX
gut microbiomes AND Type 2 diabetes.pptx
PPTX
SEMINAR 6 DRUGS .pptxgeneral pharmacology
PPTX
1.-THEORETICAL-FOUNDATIONS-IN-NURSING_084023.pptx
PDF
FMCG-October-2021........................
PDF
Tackling Intensified Climatic Civil and Meteorological Aviation Weather Chall...
PPTX
FORENSIC MEDICINE and branches of forensic medicine.pptx
PPTX
Local Anesthesia Local Anesthesia Local Anesthesia
PPTX
Critical Issues in Periodontal Research- An overview
PPTX
Approach to Abdominal trauma Gemme(COMMENT).pptx
PPTX
GAIT IN HUMAN AMD PATHOLOGICAL GAIT ...............
PPTX
etomidate and ketamine action mechanism.pptx
PPTX
Surgical anatomy, physiology and procedures of esophagus.pptx
PPTX
Computed Tomography: Hardware and Instrumentation
PDF
Integrating Traditional Medicine with Modern Engineering Solutions (www.kiu....
PPTX
INTESTINAL OBSTRUCTION - IDOWU PHILIP O..pptx
PPT
intrduction to nephrologDDDDDDDDDy lec1.ppt
presentation on dengue and its management
PLANNING in nursing administration study
Genetics and health: study of genes and their roles in inheritance
المحاضرة الثالثة Urosurgery (Inflammation).pptx
gut microbiomes AND Type 2 diabetes.pptx
SEMINAR 6 DRUGS .pptxgeneral pharmacology
1.-THEORETICAL-FOUNDATIONS-IN-NURSING_084023.pptx
FMCG-October-2021........................
Tackling Intensified Climatic Civil and Meteorological Aviation Weather Chall...
FORENSIC MEDICINE and branches of forensic medicine.pptx
Local Anesthesia Local Anesthesia Local Anesthesia
Critical Issues in Periodontal Research- An overview
Approach to Abdominal trauma Gemme(COMMENT).pptx
GAIT IN HUMAN AMD PATHOLOGICAL GAIT ...............
etomidate and ketamine action mechanism.pptx
Surgical anatomy, physiology and procedures of esophagus.pptx
Computed Tomography: Hardware and Instrumentation
Integrating Traditional Medicine with Modern Engineering Solutions (www.kiu....
INTESTINAL OBSTRUCTION - IDOWU PHILIP O..pptx
intrduction to nephrologDDDDDDDDDy lec1.ppt

ContentMining and Clinical Trials