Brassey 2019 - Statistica
Brassey 2019 - Statistica
BMJ EBM: first published as 10.1136/bmjebm-2018-111126 on 29 August 2019. Downloaded from http://ebm.bmj.com/ on September 22, 2019 at James Cook University. Protected by
Developing a fully automated evidence synthesis tool
for identifying, assessing and collating the evidence
Jon Brassey, 1 Christopher Price,1 Jonny Edwards,2
Markus Zlabinger,3 Alexandros Bampoulidis,3,4
Allan Hanbury3
10.1136/bmjebm-2018-111126 Abstract could go to help develop a fully automated form of
Evidence synthesis is a key element of evidence- evidence synthesis.
1 based medicine. However, it is currently hampered The subproject was led by the Trip Database,8 an
Trip Database Ltd, Newport,
by being labour intensive meaning that many trials ‘evidence-based’ clinical search engine. Trip, which
UK
2
Thoughtful Technology, are not incorporated into robust evidence syntheses started in 1997, is an extensively used search engine
Newcastle, UK and that many are out of date. To overcome with a large selection of different evidence types, for
3
Institute of Information this, a variety of techniques are being explored, example, clinical guidelines, SRs and randomised
Systems Engineering, TU including using automation technology. Here, controlled trials (RCTs).
Wien (Vienna University of we describe a fully automated evidence synthesis
Technology), Vienna, Austria system for intervention studies, one that identifies Methods
4
Research Studio Data all the relevant evidence, assesses the evidence for To explore this challenge, it was important to under-
Science, RSA FG, Vienna,
reliability and collates it to estimate the relative stand the typical process of an SR and how auto-
Austria
effectiveness of an intervention. Techniques mation could be used. Cochrane, a prominent SR
used include machine learning, natural language producer, reports9:
Correspondence to: processing and rule-based systems. Results are
Each systematic review addresses a clearly
Mr Jon Brassey, Trip Database visualised using modern visualisation techniques.
formulated question; for example: Can an-
Ltd, Newport NP20 3PS, UK; We believe this to be the first, publicly available,
tibiotics help in alleviating the symptoms
jon.b
rassey@tripdatabase. automated evidence synthesis system: an evidence
com
of a sore throat? All the existing primary
mapping tool that synthesises evidence on the fly.
research on a topic that meets certain cri-
copyright.
teria is searched for and collated, and then
assessed using stringent guidelines, to es-
tablish whether or not there is conclusive
Introduction evidence about a specific treatment.
Evidence-based medicine relies on robust evidence
From the above, for a given clinical question, we
to inform decision-making1—a main source of the
can extract the following questions that would need
evidence is the systematic review (SR).2 A major
answering in a successful system:
problem with SRs is that they are labour inten-
1. Can all the evidence be identified?
sive and can take up to 2 years to produce.3 In an
2. Can this evidence be assessed?
effort to reduce this workload and lack of respon-
3. Can it be collated to establish if there is
siveness for end users, two main areas have started
conclusive evidence?
to be explored. First, rapid reviews have typically
These were the core tasks of this project.
explored ways of streamlining the SR process in
order to reduce the workload and this has met with
some success.4 The second area is an automation, Identify the evidence
which seeks to replace, or support, current human For this work, both RCTs and SRs of interven-
activity. This paper will focus on the automation side tion were used, this reflects the two main sources
of improving the timeliness of evidence synthesis. of evidence for interventions. The Trip Database
Currently, there are a number of separate organ- includes RCTs and SRs within its search index.
isations exploring automation, many of which are These are identified from sources such as PubMed
part of the International Collaboration for the Auto- and organisational websites.
© Author(s) (or their mation of Systematic Reviews.5 6 However, we are In order to combine the trials and reviews, it is
employer(s)) 2019. No still many years away from a system that automati- a prerequisite that we can understand what they
commercial re-use. See cally produces SRs. are about. To accomplish this, we used the concept
rights and permissions. In an effort to support and foster research in of PICO,10 an acronym, which highlights what the
Published by BMJ. the European Research Area, the European Union population, intervention, comparison and outcomes
created the Horizon 2020 programme. KConnect,7 are. For this step, we only required the PIC elements
To cite: Brassey J, Price C,
led by the TU Wien, was one successful project as the ‘outcomes’ were not necessary in identifying
Edwards J, et al. BMJ
Evidence-Based Medicine with the aim of supporting the commercialisation similar articles.
Epub ahead of print: of various medical text analysis technologies. A
[please include Day Month specific subproject was to deliver an automated PIC identification
Year]. doi:10.1136/ evidence review system for intervention studies. The The aim of the PIC identification is to automatically
bmjebm-2018-111126 idea was to explore how far the current technologies extract the population (eg, men with asthma), the
BMJ EBM: first published as 10.1136/bmjebm-2018-111126 on 29 August 2019. Downloaded from http://ebm.bmj.com/ on September 22, 2019 at James Cook University. Protected by
Sentiment analysis
Sentiment analysis is a natural language processing task that has
typically been applied for tasks such as judging whether a movie
review (as text) was positive or negative in relation to the film.11 12
Here, we apply the method to automatically assess whether an
Figure 1 The PIC identification algorithm. article’s conclusion is in favour or not of an intervention based on
non-active comparisons (eg, placebo or usual care). For instance
‘For patients with post-ACS depression, active treatment had a
intervention (eg, treated with vitamin C) and its comparison (eg, substantial beneficial effect on depressive symptoms’ might be
receiving placebo) from RCTs. To do so, we exploited commonly described as a positive sentiment for the intervention. The model
occurring linguistic patterns. For example, consider the sentence: is unaware of the interventions or outcomes of interest (a direc-
The bioavailability of nasogastric versus trovafloxacin in healthy tion for future work), but we hope nonetheless that the attempting
subjects. In this sentence, a possible pattern is ‘[…] of […] vs […] in to capture general sentiment will provide a reasonable signal.
[…]’; in which the preposition ‘of’ indicates the start of the Interven- A model is trained on a corpus of examples labelled as
tion, ‘vs’ separates the Intervention from the Comparison and finally, conveying positive or negative sentiment. The training process
‘in’ indicates the start of the population. builds a general model that can be subsequently used to assess
Since the text patterns of RCTs are variable, it was necessary to the sentiment of a new conclusion. Internally, models exploit a
identify multiple patterns that cover the most common cases. To variety of representations, in our case the final algorithm choice
create ground truth data for this task, we employed six human anno- after comparing a selection of 12 common approaches was a
tators (two linguists and four people from the medical domain) and gradient boosted tree,13 which uses an ensemble of rules to
asked them to manually label the PIC elements in the title and the provide a decision.
abstract of 1750 RCTs. The RCTs were randomly sampled (without In our testing, we used a training corpus of 1000 PubMed
date limitation) from the PubMed/MEDLINE medical article data- ‘conclusions’ (both RCTs and SRs) and the XGBoost algorithm.
base, whereas we filtered for articles of the type RCTs that are in We tested a number of established algorithms and XGBoost gave
English and for human-based medicine. To measure the annotation the best performance in terms of accuracy, with figures of 90.0%
accuracy of the annotators, we annotated 20 RCTs of the 1750 with for both positive and negative examples assessed on a previously
the help of a medical expert. These 20 RCTs were labelled by all unseen test dataset.
six annotators, which allowed us to evaluate the annotation accu-
racy of the annotators. We measured an average accuracy between Risk of bias
annotators and the 20 RCTs of 0.70 for the population, 0.66 for the The assessment of bias for RCTs is based on RobotReviewer's RoB
copyright.
intervention and 0.62 for the comparison. estimates and these are based on the words and short phrases used
After manual labelling, we used 80% of the labelled RCTs in the titles and abstracts of papers.14 RobotReviewer has ‘learnt’
(training set) to derive the most frequently occurring patterns for how to assess various common biases by examining the titles
PIC identification. Afterwards, we created an algorithm that checked and abstracts of tens of thousands of articles describing clinical
if an input sentence conforms to one of the identified patterns; if trials that are included in the Cochrane Database of Systematic
so, the PIC information was extracted. An overview of the process Reviews. These trials have all been manually assessed for bias
is given in figure 1. by SR authors, using the Cochrane RoB tool. These annotations
Due to the complex text structure of abstracts, we decided were used to trained machine learning models, which are used to
to focus on the titles of RCTs. In most cases, the title already predict a risk of bias score. The output is either ‘low RoB’ or ‘high/
contained all of the desired PIC information. In the cases where unclear RoB’. RoB assessment may be automated with reasonable
the PIC information was only included in the abstract and not accuracy using the abstracts alone.
the title, our algorithm did not report any findings, leading to a For SRs, the assessment of bias has not been possible, and
false negative. Based on a test set (20% of the 1750 RCTs), which therefore, we have allocated Cochrane Systematic Reviews as ‘low
was not considered during the earlier described pattern creation RoB’ and all other SRs as ‘high/unclear RoB’. We note that this is
process, we computed how well our derived pattern matching an approximation, and also that the system would assign a ‘low’
algorithm performs. On this test set data, we measured a preci- RoB to a well conducted Cochrane review, which only included
sion respectively recall of 0.89/0.87 for the population, 0.72/0.88 poor quality, small studies. However, we would expect such a
for the intervention and 0.87/0.87 for the comparison. The high review to have negative sentiment, and the ‘bias’ is an indicator
precision indicates that most of the identified PIC elements were for the reliability of that conclusion.
correctly identified (ie, low false positive rate). On the other hand,
the high recall shows that the method only missed a few PIC
elements (ie, low false negative rate). Sample size
The sample size algorithm extracts, from RCT abstracts, the number
of participants in the RCT. It was developed in a similar fashion
Assessing the evidence to PIC identification, in other words by building up a series of
This involved multiple steps and techniques: rules. We identified the sample size in 241 randomly chosen RCT
►► Sentiment analysis—this allowed us to understand if the inter- abstracts (taken from Trip’s corpus of RCTs) and derived patterns
vention had a positive or negative impact in the population from them. As an example, the most frequently appearing pattern
studied. is the first occurrence of a number followed by a specific keyword,
►► Risk of bias (RoB) assessment—to indicate if a trial was likely as in ‘Thirty-five patients undergoing total hip replacement due
to be biassed or not. to primary arthrosis were randomised into two groups’. So, in this
►► Sample size calculations—understanding how big the trial example, the first number ‘Thirty-five’ is followed by the term
was, an indication of how reliable the results are likely to be. ‘patients’. Based on such patterns, we defined the rules that are
BMJ EBM: first published as 10.1136/bmjebm-2018-111126 on 29 August 2019. Downloaded from http://ebm.bmj.com/ on September 22, 2019 at James Cook University. Protected by
Visualisation
Inspired by Information is Beautiful’s Snake Oil visualisation,15 the
results are displayed with the overall effectiveness on the y-axis.
Each ‘blob’ corresponds to an evidence group, where the size of
the blob represents the sample size while the shading reflects the
overall RoB, see figure 2.
Users can easily interact with the visualisation by only
showing articles with low RoB and those above a certain sample
size. Blobs can be arranged along the x-axis alphabetically, or by
RoB, number of articles or overall sample size.
Second-level visualisation occurs when a user clicks on an
individual blob. It then charts each trial or SR along two axes—
RoB and sentiment, see figure 3. Again, blob size represents
sample size. This allows users to easily see the distribution of the
evidence base.
Discussion
To the best of our knowledge, this is the first publicly available
system to automatically undertake evidence synthesis. As well as
machine learning, it uses modern data visualisation techniques
to help users rapidly understand the evidence base for a given
condition. As the system is fully automatic, it means that all RCTs
and SRs are included in the syntheses meaning that all conditions
Figure 2 Initial visualisation. In this example, for acne, each blob and interventions are covered and, given the nature of the system,
represents a single intervention. Blob sizes proportionate to sample size.
as new RCTs and SRs are published they are automatically added
The higher up the chart the increased likelihood of being effective.
to the system ensuring it is always up to date.
The method described is much akin to vote counting,16
whereby the number of positive and negative studies are counted.
checked against the input RCT abstract and extract the sample
However, the technique we used has overcome one major criticism
size if one of the rules matches the input. The accuracy of the
of vote counting16 17 in that we take into account sample size of
algorithm in extracting the sample size is 80%.
copyright.
the trials and adjust the impact accordingly. In other words, a
small trial is not counted as equal to a large trial. Similarly, with
our ability to assess the RoB for a given RCT and SR, we are able
Collating the evidence
to ensure that trials with higher risks of bias carry less weight than
Once all the RCTs and SRs in Trip have been assessed for sample
unbiased trials. The emphasis on larger trials is also supported by
size, RoB and sentiment the requirement was to combine these
the knowledge that the largest trial often gives similar results to a
to estimate potential effectiveness of the intervention. These are
subsequent meta-analysis.18
then used to create the variables used to visualise the results. All
As the project has developed, it is clear that these are not the
trials and reviews with the same P and I are placed together in an
same as SRs in the traditional sense and resemble more a version
evidence group and the following variables are produced:
of evidence maps.19 Evidence maps are a relatively new concept
►► Group sample size—this is obtained by adding all the sample
but are, broadly, attempts to visualise the available evidence for a
sizes for the RCTs in the evidence group. SRs are not assigned
given topic. Our system certainly visualises the available evidence,
a sample size, they are used to verify, or not, the results ob-
but in addition, it combines the evidence to give an indication of
tained from the RCTs.
►► Group RoB—The sum of the sample sizes of high/unknown
RoB RCTs divided by the sum of the sample sizes of all trials.
For instance, if there were 3 trials in the evidence group, 2 at
high/unknown RoB and sample sizes of 250 each and third
trial at low RoB and a sample size of 1000 then the RoB would
be 500÷1500=0.33 (500 of 1500 patients in high/unknown
RoB category).
►► Overall effectiveness—Each trial is given a score of either +1
(representing a positive sentiment) or −1 (negative sentiment).
For each trial, this score is reduced if the trial is at high/un-
known RoB and/or if the sample size is small. The smaller the
trial the greater the reduction, reflecting the greater uncer-
tainty associated with small trials. These scores are weighted
to give an overall score. For instance, if one trial had a sample
size of 900 and an overall score of 1 (representing a large trial
with low RoB) and the other trial had a sample size of 100 and
an overall score of 0.25 (representing a small trial with high/
Figure 3 Second-level visualisation. Here, each RCT (green) or
unknown RoB), the overall estimate of effectiveness would systematic review (purple) is charted using two variables, sentiment and
be 0.925. risk of bias. Again, blob size is linked to sample size.
BMJ EBM: first published as 10.1136/bmjebm-2018-111126 on 29 August 2019. Downloaded from http://ebm.bmj.com/ on September 22, 2019 at James Cook University. Protected by
the likely effectiveness of a given intervention allowing relative References
ranking to be achieved. So, our system could be seen as a hybrid; 1 Sackett DL, Rosenberg WM, Gray JA, et al. Evidence based medicine: what
a system that both maps the available evidence but also seeks to it is and what it isn't. BMJ 1996;312:71–2.
assess, and rank, the relative effectiveness of the interventions 2 Bastian H, Glasziou P, Chalmers I. Seventy-Five trials and eleven
systematic reviews a day: how will we ever keep up? PLoS Med
for a given condition. Such a method has many potential uses,
2010;7:e1000326.
for instance, it can give an immediate overview of interventions
3 The Cochrane Oversight Committee. Measuring the performance of the
for a given condition allowing a user to understand the potential Cochrane library, 2012. Available: https://www.cochranelibrary.com/cdsr/
effectiveness of a particular intervention and compare to other doi/
interventions. 4 Tricco AC, Antony J, Zarin W, et al. A scoping review of rapid review
This autosynthesis project is very much in the development methods. BMC Med 2015;13:224.
and is released as a ‘proof of concept’. We aim to incorporate 5 O'Connor AM, Tsafnat G, Gilbert SB, et al. Still moving toward automation
RCTs that exist outside of PubMed (currently our only source) of the systematic review process: a summary of discussions at the third
and we are starting to work on methods to incorporate outcomes meeting of the International collaboration for automation of systematic
reviews (ICASR). Syst Rev 2019;8:57.
and adverse effects data and using statistical techniques to infer
6 Beller E, Clark J, Tsafnat G, et al. Making progress with the automation
likely effect sizes. Errors in information extraction can lead to of systematic reviews: principles of the International collaboration for the
papers being incorrectly classified—this can be improved through automation of systematic reviews (ICASR). Syst Rev 2018;7:77.
obtaining more annotated text to better train the algorithms. 7 KConnect. Improve the flow of medical information for better decisions
However, the interactive visualisation allows such errors to be and better care. Available: http://www.kconnect.eu/
found relatively easily. Most importantly, we will seek to explore 8. Available: https://www.tripdatabase.com/
the validity of our approach, and we are actively seeking funding 9 Chapman S. What are Cochrane reviews? Evidently Cochrane 2014.
to achieve this. 10 CEBM. Asking focused questions, 2014. Available: http://www.cebm.net/
blog/2014/06/10/asking-focused-questions/
Another avenue we would like to explore is to incorporate
11 Pang B, Lee L, Lillian L. Opinion mining and sentiment analysis.
knowledge of unpublished trials within the algorithms/visualisa-
Foundations and Trends® in Information Retrieval 2008;2:1–135. and.
tions. Up to 50% of all clinical trials are unreported20 and this can 12 Pang B, Lee L, A sentimental education: sentiment analysis using
have profound effects on estimates of effectiveness of meta-anal- subjectivity summarization based on minimum cuts. ACL '04 Proceedings
yses.21 22 Yet, we know that traditional approaches to SRs are poor of the 42nd Annual Meeting on Association for Computational Linguistics
at using unpublished trials.23 Using our approach, we could rela- Article No. 271.
tively simply add this data. 13 Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In:
Our project is novel and there is uncertainty as to how it will Proceedings of the 22nd ACM SIGKDD International Conference on
be received. We are mindful of Moher et al’s paper24 that describes Knowledge Discovery and Data Mining (KDD '16. New York, NY, USA:
copyright.
ACM, 2016: 785–94.
different evidence review methods as being part of the same
14 Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system
family with new methods appearing over time. Where this system
for automatically assessing bias in clinical trials. J Am Med Inform Assoc
fits, if at all, will be decided by others, not these authors. 2016;23:193–201. and.
1
5. Available: http://informationisbeautiful.net/visualizations/snake-oil-
Acknowledgements Iain Marshall and Byron Wallace, both from scientific-evidence-for-nutritional-supplements-vizsweet/
RobotReviewer. 16 Cooper H, Hedges L. V, Valentine JC. The Handbook of research synthesis
and meta-analysis. New York: Russell SAGE Foundation, 2009. Project
Contributors JB and CP conceived and designed the work. MZ MUSE.
developed the algorithms for PIC extraction and conducted the 17 Harrison F. Getting started with meta-analysis. Methods Ecol Evol
evaluation. AB developed the sample size methods. AH led the 2011;2:1–10.
overall grant helped develop the overview methods for the work 18 Glasziou PP, Shepperd S, Brassey J. Can we rely on the best trial? A
and oversaw the work of MZ and AB. JE developed the sentiment comparison of individual trials and systematic reviews. BMC Med Res
analysis system. All authors contributed to the drafting and Methodol 2010;10:23.
completion of the paper. 19 Miake-Lye IM, Hempel S, Shanman R, et al. What is an evidence MAP?
A systematic review of published evidence maps and their definitions,
Funding European Commission > Horizon 2020 Framework methods, and products. Syst Rev 2016;5:28.
Programme No 644753 (KConnect). 2 0. Available: http://www.alltrials.net/news/half-of-all-trials-unreported/
2 1. Turner EH, Matthews AM, Linardatos E, et al. Selective publication of
Competing interests Both JB and CP are shareholders in Trip
antidepressant trials and its influence on apparent efficacy. N Engl J Med
Database. JB is also a member of the editorial board of BMJ
2008;358:252–60.
EBM. No other declared competing interests from the other 2 2. Hart B, Lundh A, Bero L, et al. Effect of reporting bias on meta-analyses of
authors. drug trials: reanalysis of meta-analyses. BMJ 2011;344:d7202.
Patient consent for publication Not required. 2 3. Schroll JB, Bero L, Gotzsche PC, et al. Searching for unpublished data for
Cochrane reviews: cross sectional study. BMJ 2013;346:f2231.
Provenance and peer review Not commissioned; externally peer 2 4. Moher D, Stewart L, Shekelle P. All in the family: systematic reviews, rapid
reviewed. reviews, scoping reviews, realist reviews, and more. Syst Rev 2015;4:183.