1 Introduction

In the pursuit of understanding natural phenomena, scientists often formulate plausible theories and design experiments to test competing hypotheses. This process typically involves careful experimental design and abductive reasoning. Experimentation can be optimised through active learning (Mitchell, 1982; Cohn et al., 1994), where informative training examples are strategically selected to combat resource limitations. Inductive Logic Programming (ILP) (Muggleton, 1991) offers an automated approach to abduction, representing observations, hypotheses and background knowledge through interpretable logic programs. In ILP, hypotheses are learned to explain observational data with respect to the background knowledge. “Askable” hypotheses, known as abducibles, that might explain contradictions between observations and prior background knowledge are proposed (Bryant et al., 2001; Muggleton & Bryant, 2000). The extended knowledge, enriched with abduced hypotheses, is subsequently verified against observations.

Our work explored the applicability of abductive reasoning and active learning to identify the functions of metabolic genes in model bacterial organisms. Given that biological relationships are commonly described logically, ILP is particularly adept at operating on biological knowledge bases. The integration of abductive reasoning and active learning via ILP was successfully demonstrated in the context of biological discovery by the Robot Scientist (King et al., 2004). The Robot Scientist performed active learning by strategically selecting key experiments to achieve more data- and cost-effective gene function learning than random experimentation. However, this demonstration was limited to only 17 genes in the aromatic amino acid pathway of yeast. Our work significantly advanced the application of this integrative learning paradigm to genome-scale biological discoveries by examining genome-scale metabolic models (GEMs). We examined the GEM model iML1515 (Monk et al., 2017), which encompasses 1515 genes and 2719 metabolic reactions of the Escherichia coli (E. coli) strain K-12 MG1655, a versatile organism for metabolic engineering to produce specific compounds.

Fig. 1
figure 1

\(BMLP_{active}\) encodes the GEM iML1515 as Boolean matrices and predicts the auxotrophic mutant phenotypes. It actively consults a data source to request ground truth labels, which minimises the expected experimental cost based on a user-provided cost function. The underlying BMLP iteratively refutes gene function annotation hypotheses inconsistent with labelled training examples. While \(BMLP_{active}\) to date only learns from synthetic data, a laboratory robot can be integrated to perform selected experiments

Despite the extensive mapping of the E. coli genome to metabolic functions in iML1515, inaccurate phenotype predictions have been identified due to erroneous gene function annotations (Bernstein et al., 2023). Learning gene function annotations in GEMs is computationally and empirically challenging. We built the first logic programming system on GEMs \(BMLP_{active}\) (Fig. 1) based on our novel Boolean Matrix Logic Programming (BMLP) approach. \(BMLP_{active}\) uses Boolean matrices to encode the biochemical and genetic relationships in iML1515 and enables high-throughput logical inferences to predict auxotrophic mutant phenotypes. \(BMLP_{active}\) implements active learning to select auxotrophic mutant experiments that minimise the expected experimental cost via a user-defined cost function. \(BMLP_{active}\) successfully learned gene function annotations through active learning while reducing 90% the optional nutrient substance cost required by randomly selected experiments. It additionally reduced the number of experimental data needed compared to random experiment selection. When operating within a finite budget, our approach could deliver optimal experimental outcomes, whereas studies employing random experiment selection might not reach completion.

Furthermore, we applied \(BMLP_{active}\) to learning digenic function annotations, which holds significant implications for drug development and therapeutic interventions (Costanzo et al., 2016). Digenic functions are related to isoenzymes - two enzymes responsible for the same reaction. Digenic interactions between complex genotypes and phenotypes remains largely unexplored in most organisms (Costanzo et al., 2019), and their dynamics depend on the growth conditions (Bernstein et al., 2023). A comprehensive understanding of digenic interactions would expedite strain engineering efforts. We show that \(BMLP_{active}\) converged to the correct gene-isoenzyme mapping with as few as 20 training examples. This represents a significant promise of \(BMLP_{active}\) to address complex genetic interactions on the whole genome level.

2 Representing a GEM using datalog

We refer the readers to (Lloyd, 2012; Nienhuys-Cheng & Wolf, 1997) for terminology on datalog programs. A recursive program has a body literal that appears in the head of a clause. Specifically, we focused on linear and immediately recursive datalog as simple recursive datalog programs. A linear and immediately recursive datalog program has a single recursive clause where the recursive predicate appears in the body only once (Ioannidis, 1986). This subset of recursive datalog programs can represent the relational structure of reaction pathways:

$$\begin{aligned}&pathway(X,Y) \leftarrow reaction(X,Y). \\&pathway(X,Y) \leftarrow reaction(X,Z), pathway(Z,Y). \end{aligned}$$

Petri nets are a class of directed bipartite graphs (Reisig, 2013), and they are a commonly used graphical tool to model processes in biological systems (Sahu et al., 2021). we refer readers to Reisig (2013); Rozenberg and Engelfriet (1998) for background on Petri net. Petri nets contain nodes marked by tokens (black dots), which indicate the availability of resources such as chemical metabolites. The presence of reactants or products determines which reaction pathways are viable and how the resources are allocated. We examined a class of Petri nets called the one-bounded elementary net (OEN) (Reisig, 2013), which allocates at most one token per node (definitions in Appendix A). We use OEN as a conceptual model of essential metabolites and pathways in a metabolic network. We provided an example of a simple metabolic network encoded as a linear and immediately recursive datalog program \(\mathcal {P}_1\).

figure a
figure b

Each ground fact in the program describes a reaction between a set of reactants and a set of products. Evaluating the recursive program pathway(XY) is equivalent to finding the union of token distributions that respect the viable reactions (see Appendix B).

3 Boolean matrix logic programming

We propose the Boolean Matrix Logic Programming (BMLP) approach, which uses Boolean matrices to evaluate datalog programs in contrast to the traditional symbolic logic program evaluation.

Definition 1

(Boolean Matrix Logic Programming (BMLP) problem) Let \(\mathcal {P}\) be a datalog program containing a set of clauses with predicate symbol r. The goal of Boolean Matrix Logic Programming (BMLP) is to find a Boolean matrix \({\textbf {R}}\) encoded by a datalog program such that \(({\textbf {R}})_{i,j}\) = 1 if \(\mathcal {P} \models r(c_i, c_j)\) for constant symbols \(c_i, c_j\) and \(({\textbf {R}})_{i,j}\) = 0 otherwise.

A linear and immediately recursive datalog program \(\mathcal {P}\) can be written as matrices for bottom-up evaluation (Ceri et al., 1989) and evaluated by linear equations (Sato, 2017). We created a BMLP algorithmFootnote 1 called iterative extension (BMLP-IE) (Fig. 2) to evaluate the datalog encoding of the metabolic network. BMLP-IE uses a combination of binary operations: Boolean matrix addition (ADD), multiplication (MUL) and equality (EQ) (Copilowish, 1948). This algorithm uses \(O(n^2)\) binary operations given n total metabolites in the metabolic network (Appendix D).

Fig. 2
figure 2

Iterative extension. a The vector \({\textbf {v}}\) encodes source chemical metabolites. All reactions are represented in the boolean matrices \({\textbf {R}}_1\) and \({\textbf {R}}_2\). b BMLP-IE computes \({\textbf {v}}^*\), the complete set of producible metabolites

4 Active learning

4.1 Hypothesis compression

We use the compression score of a hypothesis h to optimise the posterior probability. A learner can derive the relative frequency p(h) for h to be chosen based on the encoding of h. The compactness of h’s encoding reflects its prior probability \(p(h) = 2^{-size(h)}\) (Shannon & Weaver, 1963). We can use the compression function to compute the prior p(h). Given a set of training examples E, the compression of a hypothesis (Muggleton, 1995; Bryant et al., 2001) can be interpreted as

$$\begin{aligned} compression(h,E) = size(E) - MDL_{h,E} \end{aligned}$$

In addition, the compression relates to the posterior probability of h (Muggleton, 1995):

$$\begin{aligned} \frac{p(h|E)}{p(E|E)} = 2^{compression(h, E)} \end{aligned}$$

We can rearrange this to obtain the normalised posterior probability (Bryant et al., 2001):

$$\begin{aligned} p'(h|E) = \frac{2^{compression(h, E)}}{\sum _{h_i\in H} 2^{compression(h_i, E)}} \end{aligned}$$

The posterior probability is maximal when compression is maximal. By searching for a hypothesis with maximum posterior probability, a learner can maximise its expected predictive accuracy (Muggleton, 1995). To learn an accurate hypothesis, the goal is to find a hypothesis with the highest compression score.

The compression function in \(BMLP_{active}\) follows the Minimal Description Length (MDL) principle (Conklin & Witten, 1994). Since we have a fixed set of hypotheses without an explicit prior distribution, MDL allows us to make minimum assumptions about their prior. The most probable hypothesis h for the training examples E should be compact and have the fewest disagreements between E and predictions made by h. This minimises \(L(h) + L(E|h)\), where L(h) is the descriptive complexity of h and L(E|h) is the descriptive complexity of encoding E using h. We consider a compression function from Bryant et al. (2001) based on the MDL principle:

$$\begin{aligned} compression(h,E) = |E^+| - \frac{|E^+|}{pc_h} (size(h) + fp_h) \end{aligned}$$

where \(E^+\) is the set of positive examples seen by the active learner, \(pc_h\) is the number of positive predictions by h, and \(fp_h\) is false positives covered by h. In other words, this compression function favours a general hypothesis with high coverage and penalises over-general hypotheses that incorrectly predict negative examples. The generality of h can be estimated by computing its coverage \(pc_h\) for a set of unlabelled instances (Muggleton, 1995). We look at all available experiment instances to obtain a good estimation of the generality of hypotheses. Our BMLP approach can facilitate efficient inference for this estimation.

4.2 Expected cost of experiments

Experiment planning is often restricted by time and other resources. A rule of thumb for experiment design is to achieve optimal empirical outcomes within a finite budget. The problem of designing experiments to support or refute scientific hypotheses involves a series of binary decisions. This process is analogous to playing the game of “Guess Who” where players ask a sequence of “yes” or “no” questions whose answers binarily partition the candidate hypotheses. The outcome of an experiment t splits hypotheses H into consistent hypotheses \(H_{t}\) and inconsistent hypotheses \(\overline{H_{t}}\). The path of selection and outcomes of experiments can be represented as a binary decision tree. To construct an optimal tree, an active learner should seek to label instances consistent with up to half of the hypotheses (Mitchell, 1982).

In addition to this principle, we consider a user-defined experiment cost function \(C_t\) of experiment t. The experiment selection is represented by a binary tree with paths annotated by the costs of experiments. The optimal selection is therefore a binary tree with the minimum expected overall cost. The minimum expected cost of performing a set of candidate experiments T can be recurrently defined as:

$$\begin{aligned} EC(\emptyset ,T) = 0&\\ EC({h},T) = 0&\\ EC(H,T) = min_{t \in T} \, [C_t + p(t) EC (H_{t}, T - \{t\}) + (1 - p(t)) EC (\overline{H_{t}}, T - \{t\})] \end{aligned}$$

To estimate this recurrent cost function, \(BMLP_{active}\) uses the following heuristic function from Bryant et al. (2001) to approximate optimal cost selections:

$$\begin{aligned} EC(H,T) \approx min_{t \in T} \, [C_t&+ p(t) (mean_{t' \in T - \{t\}} C_{t'}) J_{H_{t}} \nonumber \\&+ (1 - p(t)) (mean_{t' \in T - \{t\}} C_{t'}) J_{\overline{H_{t}}}] \end{aligned}$$

where \(H_{t}\) and \(\overline{H_{t}}\) are subsets of hypotheses H consistent and inconsistent with t’s label. p(t) is the probability that the outcome of the experiment t is positive and \(J_{H} = - \sum _{h \in H} p'(h|E) \, log_2 (p'(h|E))\). The probability \(p'(h|E)\) is calculated from the compression function. To estimate p(t), we compute the sum of the probabilities of the hypotheses consistent with a positive outcome of t. Users can flexibly define the experiment cost function \(C_t\).

4.3 Active learning sample complexity

For some hypothesis space H and background knowledge BK, let \(V_s\) denote the version space of hypotheses consistent with s training examples. For an active version space learner that selects one instance per iteration, \(|V_s|\) denotes the size of the version space at the iteration s of active learning. The shrinkage of the hypothesis space can be represented by the reduction ratio \(\frac{|V_{s+1}|}{|V_s|}\) after querying the \(s+1\) label (Hocquette & Muggleton, 2018). The minimal reduction ratio \(p(x_{s+1}, V_s)\) is the minority ratio of the version space \(V_s\) partitioned by an instance \(x_{s+1}\).

Definition 2

(Minimal reduction ratio Hocquette and Muggleton (2018)) The minimal reduction ratio over the version space \(V_s\) by sampled instance \(x_{s+1}\) is

$$\begin{aligned} p(x_{s+1}, V_s) = \frac{min(\, |\{h \in V_s | \,h \cup BK \models x_{s+1}\}|, |\{h \in V_s | \,h \cup BK \not \models x_{s+1}\}|)}{|V_s|} \end{aligned}$$

The minimal reduction ratio of an actively selected instance can be computed before it is labelled. While in reality training examples might not be as discriminative, the optional selection strategy is to select instances with minimal reduction ratios as close as possible to \(\frac{1}{2}\) with the ability to eliminate up to 50% of the hypothesis space. We describe the sample complexity advantage of active learning over random example selection by the following bound on an active version space learner’s sample complexity. A passive learner is a learner using random example selection and it does not have control over the training examples it uses. Theorem 1 (proof in Appendix E) says that the number of instances needed to select by active learning should be some factor smaller than the number of randomly sampled examples given some predictive accuracy level.

Theorem 1

(Active learning sample complexity bound) For some \(\phi \in [0, \frac{1}{2}]\) and small hypothesis error \(\epsilon > 0\), if an active version space learner can select instances to label from an instance space \(\mathcal {X}\) with minimal reduction ratios greater than or equal to \(\phi \), the sample complexity \(s_{active}\) of the active learner is

$$\begin{aligned} s_{active} \le \frac{\epsilon }{\epsilon + \phi } s_{passive} + c \end{aligned}$$
(1)

where c is a constant and \(s_{passive}\) is the sample complexity of learning from randomly labelled instances.

Given a desirable predictive error \(\epsilon \), the factor in Theorem 1 can be estimated from the minimal reduction ratio \(\phi \) over the instance space. When selected instances have a larger minimal reduction ratio, the sample complexity gain would be more significant. On the other hand, if instances have low discriminative power and the minimal reduction ratio tends to zero, the sample complexity gain would be minimal. In Sect. 6, we empirically show evidence supporting this theorem for \(BMLP_{active}\) and randomly sampled experiments when learning gene function annotations in a GEM.

5 Implementation

Algorithm 1
figure c

\(BMLP_{active}\)

The input background knowledge is created from a GEM model, which is typically accessible from GEM repositories (King et al., 2016; Li et al., 2023) and libraries (Schellenberger & Que, 2011; Ebrahim & Lerman, 2013). The user provides a set of candidate experiments, hypothesis candidates and an experiment cost function. Labels of experiment instances are requested from a data source, e.g. a laboratory or an online dataset. Labelled experimental data are considered ground truths. Initially, we randomly shuffle all candidate experiments and select an experiment with the most discriminative power if its cost is under the budget. Alternatively, a discriminative experiment with the lowest cost can be selected for a fixed initialisation.

Active learning in \(BMLP_{active}\) is described by Algorithm 1. A classical method to address errors in metabolic pathways involves auxotrophic growth experiments, where specific genes are deleted to render the organism incapable of synthesising essential compounds (Beadle & Tatum, 1941). We obtain binary phenotypic classifications of cell growth comparable to an unedited wild-type strain. We identified all essential metabolites in the wild-type strain from the BMLP-IE output. A positive label (1) is a phenotypic effect, and a negative label (0) is no phenotypic effect. We predict labels from the combinations of candidate experiments T and hypotheses H and store them in a Boolean matrix \({\textbf {R}}\) of size \(|H| \times |T|\). Given the j-th actively selected instance \(t_j \in T\) and \(h_i \in V_{j} \subseteq H\), \(({\textbf {R}})_{i,j} = 1\) if \(h_i \cup BK \models t_j\) and otherwise \(({\textbf {R}})_{i,j} = 0\).

In each active learning cycle in \(BMLP_{active}\), hypotheses inconsistent with ground truth experimental outcomes are pruned. \(BMLP_{active}\) selects experiments to minimise the expected value of a user-defined cost function. This avoids repetitive predictions in hypothesis pruning and compression calculation. When computing hypothesis compression, we estimate the generality of a hypothesis by calculating its coverage for all unlabelled instances based on \({\textbf {R}}\). Currently, we only consider labels from synthetic or online phenotype data. However, \(BMLP_{active}\) can be coupled with a high-throughput experimental workflow to automate experiments.

6 Experiments

6.1 Experiment 1: BMLP-IE runtime and predictions

Table 1 Mean runtime of 100 predictions in CPU time

Here we demonstrateFootnote 2 that a SWI-Prolog application can be made significantly faster by involving Boolean matrix computation. We randomly sampled batches of 100 experiments and computed the average wall time from 10 repeats (Appendix G.2). Table 1 shows a 170 times improvement in prediction time by BMLP-IE compared to just using just SWI-Prolog. BMLP-IE uses only \(\frac{1}{600}\) of the prediction time compared to predictions with just SWI-Prolog. This result demonstrates that BMLP-IE significantly improves the speed in predicting phenotypes from a genome-scale model.

Fig. 3
figure 3

Normalised confusion matrices. The predictive accuracy of BMLP-IE is \(88.7\%\) and standard FBA is \(93.4\%\). Each number has been normalised by the size of the experiment dataset in Monk et al. (2017)

FBA (Palsson, 2015) is widely used in the literature to quantitatively predict the cell growth rate based on fluxes of biomass metabolites at the steady state. We evaluated BMLP-IE’s phenotypic predictions against FBA based on the GEM model iML1515 (Monk et al., 2017). We created normalised confusion matrices (Fig. 3) for BMLP-IE and the FBA predictions with respect to experimental data (Monk et al., 2017). The confusion matrices show a comparable predictive accuracy by BMLP-IE to that of standard FBA. Compared with FBA’s growth rate predictions, BMLP-IE mainly misclassified reduced cell growth. BMLP-IE is easily adaptable when GEM is extended, but FBA requires extensive manual parameter updates in order to tune metabolic fluxes.

6.2 Experiment 2: active learning sample complexity and experiment cost

Fig. 4
figure 4

Total experimentation cost needed for the recovered models to reach different levels of predictive accuracy. The cost definitions and details are in Appendix G and H. The predictive accuracy of an empty hypothesis is 17.4%

We examined how selecting experiments intelligently improves data and cost efficiency in contrast to selecting experiments randomly.Footnote 3 The number of experiments and data required are major determinants of experimental cost. We obtained synthetic data from iML1515 (Appendix G.3), and then masked function annotations in iML1515 to mock actual discoveries. We applied \(BMLP_{active}\) and random experiment selection with compression to decide the final hypothesis for recovering iML1515. We tested the predictive accuracy of the recovered function hypotheses against synthetic data. Each selection method was repeated 10 times.

Figure 4 shows that \(BMLP_{active}\) spends 10% of the cost used by random experiment sampling when converging to the correct hypothesis. The y-axis is the total experimental reagent cost in \(log_{10}\) calculated from selected experiments and normalised by the cost of the cheapest optional nutrient. A predictive accuracy of 100% indicates successful recovery of deleted gene function annotations. The correct annotations were recovered by \(BMLP_{active}\) and random experiment selection with \(10^{1.99}\) and \(10^{2.99}\) experimental reagent costs. Given the current hypothesis space, experimental design space and our assumption that one gram per reagent was used, learning a single annotation required \(\pounds 3.8\) and \(\pounds 38\) experimental costs from \(BMLP_{active}\) and random sampling. Consideration of differential resource factors in the user-defined experimental cost function could significantly increase the total experimental cost.

In addition, Fig. 5 shows that \(BMLP_{active}\) reduces the number of labelled data to learn accurate gene function annotations, compared to random experiment selection. While random experiment selection requires 25 experiments to recover the deleted annotations, \(BMLP_{active}\) only needed 3 experiments. Figs. 4 and 5 show \(BMLP_{active}\) can simultaneously optimise a user-defined experimental cost function and reduce the total number of experiments to learn gene function annotations accurately. \(BMLP_{active}\) is highly flexible and could be tailored for specific experimental objectives in a discovery process.

Fig. 5
figure 5

Reduction of the number of experiments needed to recover gene function annotations (sample complexity reduction). \(BMLP_{active}\) learns fully accurate gene function annotations with 3 experiments, while random sampling requires 25 experiments to do so. The predictive accuracy of an empty hypothesis is 17.4 %. Predictive accuracy is higher for both methods initially since experiment selection is no longer constrained by reagent costs

Fig. 6
figure 6

Ratio of sample complexity. \(BMLP_{active}\) selects instances with a minimal reduction ratio of at least 0.5%. For error \(\epsilon > 0\), the ratio of \(BMLP_{active}\) is clearly above the ratio of random sampling. This shows that Theorem 1 correctly describes the sample complexity relationship between active learning and random sampling

Figure 6 shows the relationships between predictive error and the number of experiments in Theorem 1. \(BMLP_{active}\) selected experiments with minimal reduction ratio \(\phi \ge 0.5\%\). The ratio \(\frac{\epsilon }{s_{active}}\) is greater than \(\frac{\epsilon + \phi }{s_{passive}}\) for error \(\epsilon > 0\). There are fewer points on the active learning ratio curve since \(BMLP_{active}\) converges to accuracy 100% using fewer training examples. The two curves intersect at \(\epsilon = 0\) since both converged to 100% accuracy. Theorem 1 shows that active learning (\(BMLP_{active}\)) can reduce the requirement for training examples compared with passive learning (random experiment selection) to achieve a target learning performance. In the context of scientific discovery, the experimental cost to arrive at a finding can be reduced since each actively selected experiment is more informative.

6.3 Experiment 3: learning digenic functions

Fig. 7
figure 7

tyrB isoenzyme function recovery frequency. We obtain the frequency of successful recovery from various randomisation seeds and compute the standard errors. We observe that within 64 experiments, random experiment selection does not recover the tyrB isoenzyme function. In contrast, \(BMLP_{active}\) can recover this function with high frequency with 20 experiments

We show the recovery of a known gene-isoenzyme association, which presents a bigger empirical challenge due to the need to explore a larger experimental design space. We focused on the aromatic amino acid biosynthesis pathway. It is a critical pathway for microorganism survival that contains important gene targets for engineering aromatic amino acid production strains. Although this pathway has been intensively studied for decades, gaps remain in the understanding of gene-gene relations (Price et al., 2018). Figure 7 shows the result of recovering tyrB’s isoenzyme function. We observed that \(BMLP_{active}\) successfully recovers the correct tyrB isoenzyme function in 20 experiments. This demonstrates \(BMLP_{active}\)’s ability to discover gene-isoenzyme associations in GEM. However, to learn all isoenzyme functions, a high-throughput experimentation procedure is required and we further discuss this in Sect. 8.

\(BMLP_{active}\) significantly reduced the number of experiments needed compared to random experiment selection. \(BMLP_{active}\) provides higher information gain from each experiment and can guarantee recovery with as few as 20 experiments. While random sampling could prune most candidate hypotheses, it failed to eliminate competitive hypotheses further. In contrast to \(BMLP_{active}\), random sampling could not recover this isoenzyme function with 64 experiments and could only do this occasionally with more than 10 times the number of experiments required by \(BMLP_{active}\). Although we used a reduced set of genes in this demonstration, the combinatorial space was sufficiently large that random experiment selection became non-viable as an experimentation strategy in a discovery process. The remarkable increase in efficiency with \(BMLP_{active}\) with this task demonstrates it has potential even as the experimental design space grows.

7 Related work

7.1 Active learning

A typical approach to active learning is via membership queries, where the machine learner queries a data source to label instances according to some strategy. An active learner can be implemented to explore the version space which contains all hypotheses that are expressible by a hypothesis language and are consistent with a set of training examples (Mitchell, 1982). Instances are chosen according to a measure of informativeness such as entropy (Shannon & Weaver, 1963), diameter (Tosh & Dasgupta, 2017), the size (Mitchell, 1982; Dasgupta, 2005) and the shrinkage (Hocquette & Muggleton, 2018) of the version space for eliminating competing hypotheses. For a binary classification task, when learners are allowed to look at n unlabelled instances to compare, an active learner has n times larger probability of selecting an instance with maximal entropy than a passive learner (Hocquette & Muggleton, 2018). The general principle for a binary active learner is to label instances closest to being consistent with half of the hypotheses in the version space (Mitchell, 1982). These instances reject up to half of the hypothesis space regardless of their classifications. In the optimal case when all instances are maximally discriminative, the target hypothesis can be found via a binary search with depth logarithmic of the hypothesis space size (Angluin, 2001). \(BMLP_{active}\) makes membership queries to simultaneously reduce the number of labels and the expected cost of experimentation.

7.2 Computational scientific discovery

Computational discovery systems (Langley et al., 1987; Todorovski & Džeroski, 2006; Brunton et al., 2016; Guimerà et al., 2020; Petersen et al., 2020) are mostly responsible for formulating symbolic hypotheses from experimental results, which are not directly applicable for experimental planning. The automation of experimental design was investigated by the pioneering Robot Scientist (King et al., 2004) to bring together logical reasoning, active learning and laboratory automation. The Robot Scientist automatically proposes explanatory hypotheses, actively devises validation experiments, performs experiments using laboratory robotics, and interprets empirical observations. It was combined with a laboratory robot to learn gene function annotations for aromatic amino acid pathways, and the active learning strategy significantly outperformed random experiment selection (King et al., 2004). This was extended to identify genes encoding orphan enzymes from a logical model of yeast consisting of 1200 genes (King et al., 2009). In comparison, we apply \(BMLP_{active}\) to a genome-scale metabolic network, which is significantly larger than the models previously examined. In addition, we examined isoenzymes, which were previously unexplored by logic programming due to larger hypothesis and experimental design spaces.

Recently, large language models (LLMs) have been explored to automate experimental design via user-provided prompt texts (Boiko et al., 2023; Bran et al., 2023). LLMs are generative models of natural languages based on neural networks that are trained from textual data. In Boiko et al. (2023); Bran et al. (2023), LLMs served as experiment planners and coding assistants. These platforms use LLMs to access publicly available knowledge bases, write and interpret programs to control laboratory hardware and analyse collected experimental data. The advantage is accessing rich information online and interacting with users while designing experiments. However, the faithfulness of their outputs raises concerns. It has been stressed that AI autonomous experimentation should be treated with caution, when there are no means to identify the source of error and explain observations (Ren et al., 2023). \(BMLP_{active}\)’s logical representation and computation enhance faithfulness and output verifiability, which do not suffer from the ambiguity of natural languages.

7.3 Learning genome-scale metabolic network models

GEMs contain known metabolic reactions in organisms, serving as functional knowledge bases for their metabolic processes. There has been increasing interest in integrating machine learning methods with GEMs to improve the predictions of gene-phenotype correlations within biological systems (Angione, 2019; Sen et al., 2021). This often requires constraint-based modelling techniques for simulating GEMs (Rana et al., 2020). Flux balance analysis (FBA) (Palsson, 2015) is a widely used mathematical tool to solve flux distributions in genome-scale models under the steady-state condition. FBA uses linear programming to model fluxes in a chemical reaction network represented by a matrix that contains the stoichiometric coefficients of the metabolites in reactions. Ensemble learning frameworks such as Wu et al. (2016); Oyetunde et al. (2019); Heckmann et al. (2018) obtained support vector machines, decision trees and artificial neural networks to represent hidden constraints between genetic factors and metabolic fluxes. These systems cannot identify inconsistencies between the simulation results and the experimental data to self-initialise the constraints to aid learning (Rana et al., 2020). On the other hand, mechanistic information that reflects whole-cell dynamics has been incorporated to address these concerns, for instance in artificial neural networks to tune their parameters (Yang et al., 2019). The hybrid approach in Faure et al. (2023) embedded FBA within artificial neural networks based on custom loss functions surrogating the FBA constraints. Recent research also explored autoencoders to learn the underlying relationship between gene expression data and metabolic fluxes through fluxes estimated by FBA (Sahu et al., 2021). Machine learning of genome-scale metabolic networks has been significantly limited by the availability of training data (Sen et al., 2021). \(BMLP_{active}\) does not use FBA, so it is not dependent on metabolic flux constraints. \(BMLP_{active}\) also performs active learning to reduce the cost and number of experiments. This is a notable advantage considering finite experimental resources and insufficient experimental data.

7.4 Matrix algebraic approaches to logic programming

Obtaining the least Herbrand model of recursive datalog programs can be reduced to computing the transitive closure of Boolean matrices (Peirce, 1932; Copilowish, 1948). Fischer et al. (1971) studied a divide-and-conquer Boolean matrix computation technique by viewing relational databases as graphs. A similar approach was explored by Ioannidis (1986) for computing the fixpoint of recursive Horn clauses. An ILP system called DeepLog (Muggleton, 2023) employed Boolean matrices for learning recursive datalog programs. It repeatedly computes square Boolean matrices to quickly choose optimal background knowledge to derive target hypotheses. A propositional and Boolean matrices representation of biological knowledge bases was explored in Sato and Kojima (2021); Sato and Inoue (2023) by considering gene regulatory networks as Boolean networks. In our BMLP framework, we use Boolean matrices to evaluate first-order recursive datalog programs.

Recent work primarily studied the mapping of logic programs to linear equations over tensor spaces. SAT problems were investigated (Lin, 2013) under a linear algebra framework, where truth values of domain entities, logical relations and operators for first-order logic are evaluated using tensors-based calculus (Grefenstette, 2013). Computing algebraic solutions can approximate recursive relations in first-order Datalog (Sato, 2017). Based on this approach, abduction can be performed by encoding a subset of recursive datalog programs in tensor space (Sato et al., 2018). Tensor representations allow differentiable infrastructures such as neural networks to perform probabilistic logical inferences (Cohen et al., 2020). This can be done using dynamic programming to propagate beliefs for stochastic first-order logic programs. The probabilistic parameters can be optimised using deep learning frameworks (Yang et al., 2017). Limitations have been identified for solving certain recursive programs due to the difficulty of computing some arithmetic solutions, whereas these can be solved by iterative bottom-up evaluations (Sato, 2017). In addition, algebraic approximations are usually calculated in linear algebraic methods where the correctness of the solutions cannot be guaranteed. However, our BMLP approach does not approximate Boolean matrices.

8 Conclusion and future work

To our knowledge, \(BMLP_{active}\) is the first logic programming system able to operate on a state-of-the-art GEM, thanks to the improved computational efficiency of Boolean matrices. \(BMLP_{active}\) learned accurate gene function annotations while reducing the experimental costs by 90%. In addition, we demonstrated \(BMLP_{active}\) actively learned gene-isoenzyme functions with 90% fewer training data points than random experiment selection. This suggests that \(BMLP_{active}\) could offset the growth in experimental design space from genome-scale function learning. Future work could explore the use of Boolean matrix encoding of logic programs for even larger networks with GPU acceleration.

One limitation of \(BMLP_{active}\) is that it cannot quantitatively predict the behaviours of biological systems. Future work could focus on using Probabilistic Logic Programming (De Raedt & Kimmig, 2015) and tensor-based algebra of logic programs (Sakama et al., 2021) given numeric training data. \(BMLP_{active}\) also does not invent new predicates. Reusing invented predicates can reduce the size of a hypothesis learned by an ILP system and improve its learning performance (Cropper et al., 2021). Recent methods (Muggleton, 2023; Evans & Grefenstette, 2018; Dai & Muggleton, 2021) have enabled predicate invention in the context of matrix-driven ILP. While we touched on the theoretical properties of active learning, comparisons with other active learning approaches and theoretical claims can further contextualise our performance.

Importantly, we show that our approach can address the interactions between multiple genes, which are not addressable by current mutagenesis approaches. We aim to integrate \(BMLP_{active}\) into a high-throughput experimental workflow (Wang et al., 2018) for whole-genome function learning. Experimental data can first be compared with phenotypic predictions, narrowing the scope of the hypothesis space. With advanced biological gene editing tools, we can target specific gene edits to experimentally confirm candidate hypotheses during active learning.

Designing experiments to explain some hypothesised phenomenon requires deliberate thinking by scientists. A helpful scientific assistant could teach scientists how to create experiment plans with optimal outcomes. This aspect of human-AI collaboration has been studied in an emerging research area on Ultra-Strong Machine Learning (USML) (Muggleton et al., 2018; Ai et al., 2021, 2023; Krenn et al., 2022). The present work can be extended to examine the effect of AI decision-making on human experiment selection and human comprehension of complex biological concepts.