MASS-CSP: mining with answer set solving for contrast sequential pattern mining

Sterlicchio, Gioacchino; Lisi, Francesca Alessandra

doi:10.1007/s10994-025-06876-0

MASS-CSP: mining with answer set solving for contrast sequential pattern mining

Open access
Published: 28 September 2025

Volume 114, article number 235, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Machine Learning Aims and scope Submit manuscript

MASS-CSP: mining with answer set solving for contrast sequential pattern mining

Download PDF

979 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we present MASS-CSP (Mining with Answer Set Solving - Contrast Sequential Patterns), a declarative approach to the Contrast Sequential Pattern Mining (CSPM) task, which is based on the logic-based framework of Answer Set Programming (ASP). The CSPM task focuses on identifying significant differences in frequent sequences relative to specific classes, leading to the concept of a contrast sequential pattern. The article describes how MASS-CSP addresses the CSPM task and related extensions-mining closed, maximal and constrained patterns. Evaluation aims at comparing the basic version of MASS-CSP against the extended versions as regards the size of output and time-memory requirements.

Mining Contrast Sequential Patterns with ASP

Mining high utility contrast patterns in sequences

Article 08 August 2025

Efficiency Analysis of ASP Encodings for Sequential Pattern Mining Tasks

1 Introduction

Recently, there has been a growing abundance of data consisting of sequences of events, items, or tokens structured within an ordered metric space. Consequently, the necessity to identify and examine frequent subsequences has emerged as a prevalent issue. To tackle this challenge, Sequential Pattern Mining (SPM) evolved as a subdivision of pattern mining (refer to Mooney and Roddick (2013) for an overview). Typically, the main task of SPM involves discovering frequent, non-empty temporal sequences, known as sequential patterns, within a sequence dataset. Another intriguing form of pattern mining is called Contrast Pattern Mining (CPM) (Dong & Bailey, 2013), where the primary task is to identify statistically significant differences or similarities, termed contrast patterns, across two or more distinct datasets, or parts of the same dataset. Both problems are recognized for their greater complexity compared to, for example, itemset mining. Nonetheless, SPM and CPM are widely applicable in various domains including the analysis of patient care pathways, consumer purchasing behaviours (for recommendation rules), educational data trails, digital logs (such as web access for customer profiling or network logs for intrusion detection) and bioinformatics sequences (Guyet et al., 2017; Lisi & Sterlicchio, 2022; Sterlicchio & Lisi, 2024; Zheng et al., 2016; Guyet, 2020). In this paper, we combine the notions of sequential and contrast patterns to identify significant differences in frequent sequences relative to specific classes, leading to the concept of a contrast sequential pattern. Although Contrast Sequential Pattern Mining (CSPM) is not a novel task, it has received limited attention to date (refer to Chen et al. (2022) for a recent review).

Recent advances in the area of so-called Declarative Pattern Mining (DPM) have revitalized the exploration of declarative approaches to several pattern mining tasks ranging from itemset mining (Guns et al., 2017; Järvisalo, 2011) to sequence mining (Négrevergne & Guns, 2015; Coquery et al., 2012; Métivier et al., 2013). The practical application of DPM approaches hinges on the efficiency of their encoding when dealing with real-world datasets. Thanks to advances in satisfiability (SAT) and constraint programming (CP) solving techniques and solvers, these methods have emerged as viable alternatives for highly constrained mining tasks, with their computational efficiency now nearly matching that of specialized algorithms. The long-term objective is to benefit from the genericity of solvers to let a user specify a potentially infinite range of constraints on the patterns. Thus, we expect to go from specific algorithm constraints to a rich query language for pattern mining.

To the best of our knowledge, there is no DPM approach that supports the CSPM task. In this paper, we address the CSPM task and propose an approach - called MASS-CSP (Mining with Answer Set Solving - Contrast Sequential Patterns) that is based on Answer Set Programming (ASP). ASP is a programming language for declarative problem solving (Lifschitz, 2016; Brewka et al., 2011). Its first-order logic syntax makes ASP programs easy to understand. Furthermore, ASP benefits from efficient solvers to compute the answer sets that correspond to solutions for difficult problems. In MASS-CSP we have developed a concise and versatile ASP encoding for the CSPM problem and for managing complex pattern preferences. More precisely, this paper sums up our efforts on the development of MASS-CSP encompassing a basic formulation of the CSPM task (Lisi & Sterlicchio, 2023), a variant with condensed representations (Sterlicchio & Lisi, 2024) and another variant with the span and gap constraints (Sterlicchio & Lisi, 2024). The focus in the present work is on the evaluative aspects of MASS-CSP, in particular on the efficiency of the solving and grounding phases.

Summing up, the contributions of this article are the following:

1.
We present MASS-CSP, a declarative approach that provides a high-level specification to the CSPM task. It demonstrates that this task an its related variants - mining closed, maximal and constrained patterns (gap and span) - can be easily encoded with ASP.
2.
We evaluate MASS-CSP along two dimensions. The first one is the pattern mining task with the ability to extract patterns and time requirements. The second dimension refers to ASP, i.e. the time and memory requirements needed for the grounding and solving phase.

The article is organized as follows. Section 2 overviews the current research in DPM. Section 3 introduces the ASP paradigm, also with the help of a small example. In Section 4, we recall the main concepts in sequential pattern mining, and contrast pattern mining, and clarify how they can be merged in CSPM. In Section 5, we explain the rationale behind our ASP encoding of the CSPM problem (basic version, and extended versions) and illustrate the technical details of the implementations. Experimental results with MASS-CSP are discussed in Section 6, whereas In Section 7 we conclude the article with final remarks.

2 Related works

Sequential and Contrast Pattern Mining are challenging tasks in data mining and play an important role in many applications. Notably, PrefixSpan is an optimized algorithm for mining sequences (Han et al., 2001). The notion of contrast is deeply discussed in Dong and Bailey (2013). Chen et al. (2022) provide an up-to-date comprehensive and structured overview of the research in Contrast Pattern Mining which includes also the case of contrast sequential patterns. In particular, Zheng et al. (2016) present a contrast sequential pattern mining (CSPM) method for taxpayer behaviour analysis, and Wu et al. (2022) propose a top-k self-adaptive CSPM solution.

Declarative pattern mining (DPM) covers many pattern mining tasks such as sequence mining (Négrevergne & Guns, 2015; Gebser et al., 2016) and frequent itemset mining (Jabbour et al., 2015; Guns et al., 2017). In Négrevergne and Guns (2015), the authors organize the constraints on sequential patterns in three categories: 1) constraints on patterns, 2) constraints on patterns embeddings, 3) constraints on pattern sets. These constraints are provided by the user and capture his background knowledge. Then, they introduce two formulations based on Constraint Programming (CP). Jabbour et al. (2015) propose SAT based encodings of itemset mining problems to overcome space complexity issues behind the competitiveness of new declarative and flexible models. In Guns et al. (2017), MiningZinc is presented as a declarative framework for constraint-based data mining.

Besides SAT and CP, Answer Set Programming (ASP) is also widely used in DPM. The first proposal is described by Guyet et al. (2014). The authors explore a first attempt to solve the sequential pattern mining (SPM) problem with ASP and compare their method with a dedicated algorithm. Next, in Gebser et al. (2016) use ASP for extracting condensed representations of sequential patterns. They focus on closed, maximal and skyline patterns. In Samet et al. (2017) is showed a method for mining meaningful rare sequential patterns with ASP, whereas Guyet et al. (2017) propose to apply an ASP-based DPM approach to investigate the possible association between hospitalization for seizure and the switch to antiepileptic drugs from a French medical database. Guyet et al. (2016) present the use of ASP to mine sequential patterns within two representations of embeddings (fill-gaps vs skip-gaps) and various kinds of patterns: frequent, constrained and condensed. Besnard and Guyet (2020) address the task of mining negative sequential patterns in ASP. A negative sequential pattern is specified by means of a sequence consisting of events to occur and of other events, called negative events, to be absent. In Lisi and Sterlicchio (2022a, 2022b) Guyet’s ASP encodings for SPM are adapted in order to address the requirements of an application in the digital forensics domain: The analysis of anonymised mobile phone recordings. Motivated by the same application, Lisi and Sterlicchio present an ASP-based approach to contrast pattern mining in Lisi and Sterlicchio (2022b). Effectiveness of the declarative approach to CSPM presented in Lisi and Sterlicchio (2023) and Sterlicchio and Lisi (2024) has been showed in Sterlicchio and Lisi (2024) to find attacks patterns in communication network. Whereas all the works mentioned so far are pure ASP-based DPM solutions, particularly interesting is the hybrid ASP-based approach proposed by Paramonov et al. (2019) which combines dedicated algorithms for pattern mining and ASP.

3 Answer set programming

In this section we introduce the Answer Set Programming (ASP) paradigm, syntax and tools. ASP is a declarative programming language. The reader can find a more extensive introduction to ASP in Brewka et al. (2011). It is used to solve hard computational problems (e.g. security analysis, planning, scheduling, configuration, semantic web, etc.) From a general point of view, declarative programming gives a description of what is a problem instead of specifying how to solve it. Various declarative programming approaches exist, each employing a distinct modelling formalism. For example, logic programming (like ASP and Prolog) relies on logical rules, SAT solvers use Boolean expressions, and constraint programming (CP) utilizes constraints to define problems. Logic-based formalisms tend to offer more readable programs compared to other declarative approaches due to their high-level syntax.

Every ASP program is made up of atoms, literals and logic rules. Atoms can be true or false and a literal is an atom a or its negation \(not \, a\). An ASP general rule has the form of \(a_0 :-\, b_1, \ldots , b_k,\, not \, b_{k+1}, \ldots , not \, b_m\), where each \(a_0\) and all \(b_j\) are atoms and \(not\) stands for default negation. In the body of the rules, commas denote conjunctions between atoms. In ASP, the above rule may be interpreted as “if \(b_1,\,\dots ,\,b_k\) are all true and if none of \(b_{k+1},\,\dots ,\,b_m\) can be proved to be true, then \(a_0\) is true”. If \(m=0\), i.e. the rule is empty and is called a fact and the symbol \(:-\) may be omitted. Such a rule states that the atom \(a_0\) as to be true. If \(a_0\) is omitted, i.e. the rule head is empty and the above rule represents an integrity constraint or denial. Adding a denial in an ASP program deletes answer sets that satisfy the denial body. The ASP syntax is also made of some extensions. In the following, we mention only those extensions we used for the MASS-CSP encoding. Rules can contain symbols for arithmetic and comparison operations as: \(r(X+Y) :- \,p(X),\,q(Y),\,X<Y\). A choice rule includes a list of atoms between braces in the head of a rule which represents a “choice” constructor. It chooses all possible ways in which the atoms will be included in the answer set. An example of choice rule is \(\{p(1); \,p(2)\}\). An aggregate is a function on a set of tuples that are normally subject to conditions. By comparing an aggregate value with given values, one can extract truth values from the evaluation of an aggregate, thus obtaining an aggregate atom. The form of an aggregate is as follows: \(s_1 \prec _1 a \{t_1:L_1; \dots , t_n:L_n\} \prec _2 s_2\). All the \(t_i\) and \(L_i\) that form the aggregate elements are tuples of terms and literals respectively; a is the name of some function that is applied to the tuple of terms \(t_i\) that remain after the evaluation of the condition expressed by \(L_i\). Finally, the result of a is compared with the terms \(s_1\) and \(s_2\) through the comparison predicates \(\prec _1\) or \(\prec _2\), respectively. Either of the two comparison predicates or both can be omitted.

Semantically, an ASP logic program induces a collection of so-called answer sets, which are distinguished models of the program determined by answer set semantics; see Gelfond and Lifschitz (1991) for details. For short, a model assigns a truth value to each atoms of the program and this set of assignments is valid. An answer set is a minimal set o true atoms that satisfies all the program rules. Answer sets are said to be minimal in the way that only atoms that have to be true are actually true.

ASP solvers (e.g. (Gebser et al., 2014) and DLV (Leone et al., 2019)) use efficient algorithms based on non-monotonic reasoning and logic programming techniques to compute answer sets for given programs. These solvers typically employ grounding techniques to convert first-order logic into propositional form so that ASP solvers can be used for solving them as efficiently as possible with the current knowledge.

Example 3.1

(Graph colouring problem) The example showed in Listing 1 illustrates the ASP syntax on encoding the graph colouring problem (see Figure 1 for the graphical representation without considering the coloured nodes). It is a way of colouring the vertices of a graph such that no two adjacent vertices are of the same colour.

Lines 1-7 specify the problem instance, or facts, encoded with predicates |node/1| and |edge/2|, where |p/n| denotes the number of terms of the predicate |p|. The input graph has 6 nodes numbered from 1 to 6 and 17 edges. Considering the denial at line 12, edges are symmetric. Line 9 is used to define constant values as input of the problem, in this case we have three different colours.

Strings beginning with upper case letters represent variables (example in Line 11). Lines 11 and 12 specify the graph colouring problem. The predicate |colour/2| encodes the colour of a node: |colour(X,1..n)| expresses that node |X| has colour from |1| to |n|. Line 13 forbids neighbour vertices |X| and |Y| to have the same colour |C|. Line 11 is a choice rule indicating that for a given node |X|, an answer set must contain exactly one atom of the form |colour(X,C}|, where |C| is a colour. The grounded version of this rule is the following:

Assume a colour assignment number, |1 = red|, |2 = yellow| and |3 = blue|. After using an ASP solver, the possible solutions (six in total) of the problem (or answer set) are showed in the following Listing or graphically represented in Figure 1.

4 Contrast sequential pattern mining

Contrast sequential pattern mining (CSPM) focuses on identifying patterns in sequential data that differentiate between distinct classes. It is derived from sequential pattern mining (SPM), that discovers frequent patterns in sequences, and contrast pattern mining (CPM), that finds patterns highlighting differences between data. By combining these approaches, contrast sequential pattern mining aims to uncover significant patterns that can explain the differences in behaviours or characteristics across multiple labelled sequences. In the following, we introduce the main concepts of CSPM in Section 4.1. Then, we illustrate how the CSPM can be modified to solve more complex mining tasks. In particular, we focus our attention on condensed representations in Section 4.2 and constrained contrast sequential patterns in Section 4.3.

4.1 The task of CSPM

In the following, we first briefly formalize the SPM problem, that aims at identifying frequent subsequences within a sequences database \(\mathcal {D}\). Throughout this article, \([n] = \{1, \ldots , n\}\) denotes the set of the first n positive integers. Let \(\Sigma\) be the alphabet, i.e, the set of items. An itemset \(A = \left\{ a_{1}, a_{2}, \ldots , a_{m} \right\} \subseteq \Sigma\) is a finite set of items. The size of A, denoted |A|, is m. A sequence s is of the form \(s = \left\langle s_{1} s_{2} \ldots s_{n} \right\rangle\) where each \(s_{i}\) is an itemset, and n is the length of the sequence. A database \(\mathcal {D}\) is a multiset of sequences over \(\Sigma\). A sequence \(s = \left\langle s_{1} \ldots s_{m} \right\rangle\) with \(s_{i} \in \Sigma\) is contained in a sequence \(t = \left\langle t_{1} \ldots t_{n}\right\rangle\) with \(m \le n\), written \(s \sqsubseteq t\), if \(s_{i} \subseteq t_{e_{i}}\) for \(1 \le i \le m\) and an increasing sequence \((e_{1} \ldots e_{m})\) of positive integers \(e_{i} \in [n]\), called an embedding of s in t. Practically speaking, an embedding is a mapping of the pattern’s items to positions in a sequence. For example, we have \(\left\langle a (cd)\right\rangle \sqsubseteq \left\langle a b (cde)\right\rangle\) relative to embedding (1, 3). Here, (cd) denotes the itemset made of items c and d. Given a database \(\mathcal {D}\), the cover of a sequence s is the set of sequences in \(\mathcal {D}\) that contain s: \(cover (s, \mathcal {D}) = \{t \in D \,|\, s \sqsubseteq t\}\). The number of sequences in \(\mathcal {D}\) containing s is called its support, that is, \(supp(s,\mathcal {D}) = | cover (s, \mathcal {D})|\). For an integer minsup (which is often referred to as the minimum support threshold), the problem of frequent sequence mining is to discover all sequences s such that \(supp(s, \mathcal {D}) \ge minsup\). Each sequence that satisfies this requirement is called a (sequential) pattern.

A contrast sequential pattern is defined as a sequential pattern that occurs frequently in one sequence dataset but not in the others (Chen et al., 2022). We start by introducing the concept of growth rate. Given two sequence datasets, \(\mathcal {D}_1\) labelled with class \(C_1\) and \(\mathcal {D}_2\) labelled with class \(C_2\), the growth rate from \(\mathcal {D}_2\) to \(\mathcal {D}_1\) of a sequential pattern s is defined as \(GR_{C_1}(s) = \frac{supp(s,\mathcal {D}_1)/|\mathcal {D}_1|}{supp(s,\mathcal {D}_2)/|\mathcal {D}_2|}\). If \(supp(s,\mathcal {D}_2) = 0\) and \(supp(s, \mathcal {D}_1) \ne 0\) then \(GR_{C_1}(s) = \infty\). In the same way, the growth rate from \(\mathcal {D}_1\) to \(\mathcal {D}_2\) of s is defined as \(GR_{C_2}(s) = \frac{supp(s,\mathcal {D}_2)/|\mathcal {D}_2|}{supp(s,\mathcal {D}_1)/|\mathcal {D}_1|}\). If \(supp(s,\mathcal {D}_1) = 0\) and \(supp(s, \mathcal {D}_2) \ne 0\) then \(GR_{C_2}(s) = \infty\). The contrast rate of s is denoted as \(CR(s) = max\{GR_{C_1},GR_{C_2}\}\) and if \(GR_{C_1}(s) = 0\) and \(GR_{C_2}(s) = 0\) then \(CR(s) = \infty\). A sequence s in a sequences dataset is said to be a contrast sequential pattern if its contrast rate is not lower than the given threshold: \(CR(s) \ge mincr\).

Example 4.1

(CSPM task) Table 1 shows a sequences dataset \(\mathcal {D}\) of account access which is obtained by merging the datasets \(\mathcal {D}_1\) and \(\mathcal {D}_2\) that contain normal and attack sequences, respectively. We start by finding sequential patterns first and given \(minsup=2\), \(\langle login\_attempt,authorized \rangle\) is a sequential pattern because it occurs in sequences 1, 2 and 4. Another example is \(\langle login\_attempt,auth\_failed,login\_attempt,auth\_failed \rangle\) within sequences 3 and 4. Assuming we have found all the sequential patterns, we check whether these are contrasting for one of the two classes. Given \(mincr=2\), \(p_1=\langle login\_attempt,authorized \rangle\) and the metrics \(supp(p_1,\,\mathcal {D}_1)=2\), \(supp(p_1,\,\mathcal {D}_2)=1\), \(GR_{ normal }(p_1)=2\), \(GR_{ attack }(p_1)=0.5\), and \(CR(p_1)=2\), \(p_1\) is a contrast sequential pattern for normal because \(CR(p_1) \ge mincr\). Given \(p_2= \langle login\_attempt,auth\_failed,login\_attempt,auth\_failed \rangle\) and its metrics \(supp(p_2,\,\mathcal {D}_1)=0\), \(supp(p_2,\,\mathcal {D}_2)=2\) and \(GR_{ attack }(p_2)=\infty\), \(p_2\) is a contrast sequential pattern only for the attack class. This toy example suggests how contrast patterns can help in identifying key differences in sequences between normal and attack sequences in log analysis which can be used for anomaly detection.

Table 1 An example of dataset concerning normal and attack sequences to access account

Full size table

4.2 Condensed representations

Condensed patterns (Guyet, 2020; Gebser et al., 2016) are subsets of the frequent patterns that can be considered as representative of all the content of the frequent patterns. There are two types of condensed patterns to define: maximal pattern and closed pattern. A closed pattern is such that none of its frequent super-patterns have the same support. A maximal pattern is such that none of its super-patterns are frequent. The two formal definitions of maximal and closed pattern defined in Gebser et al. (2016) will be set out and discussed below. A pattern s is maximal, if there are no other patterns t such that \(s \subseteq t\) and \(supp( s,\mathcal {D} ) \ge minsup\). A pattern s is closed, if no other pattern t exists such that \(s \subseteq t\) and \(supp( s,\mathcal {D} ) = supp( t,\mathcal {D} )\). Mining closed patterns drastically reduces the number of patterns without loss of information for the analyst (Guyet, 2020). Mining maximal patterns is simpler as only the largest ones are required to be shown. More research has been done on closed patterns that are more difficult to extract, but represent the same information compared to the total of all patterns extracted (Guyet, 2020).

4.3 Gap and span constraints

Section 4.1, is a starting point for the CSPM task. For illustrative purposes, let us consider the pattern \(\langle a,\,b \rangle\) and the sequences \(\langle a,\,b,\,c\rangle\) and \(\langle a,\,c,\,c,\,b\rangle\). First, they do not deal with the number of gaps between one embedding and another. In other words, two consecutive items of a sequential pattern can be n gaps apart within a sequence, in the example 0 and 2 respectively. Secondly, \(\langle a,\,b \rangle\) has support in both sequences but with different span, namely 1 and 3 respectively. These two observations may be crucial in different application domains because patterns that reflect certain characteristics are more informative. In Pei et al. (2007) there were defined many types of constraints on patterns and embeddings, among which the ones based on the notion of gap and span

These constraints are useful in several ways: 1) by applying the span and gap constraints, we can reduce the number of candidate patterns that need to be generated and checked, which can significantly improve the efficiency of the mining process; 2) the span and gap constraints can help filter out patterns that do not make sense in the context of the data. For example, if we know that certain events should happen closely together in time, we can set a small span constraint to filter out patterns that have a large gap between them; 3) by applying the span and gap constraints, we can identify patterns that are meaningful and interesting, rather than just finding random combinations of items; 4) by limiting the number of items between two items in a pattern, we can improve the interpretability of the pattern and make it easier to understand the relationships between the items; 5) by limiting the number of items between two items in a pattern, we can reduce the noise in the data and focus on the most important items.

The span constraint specifies the minimum/maximum length allowed for a sequential pattern. As illustrated in Figure 2, it is the difference between its last item timestamp that is 6 and its first item timestamp, i.e. 1, and thus \(\langle login\_attempt,authorized \rangle\) has span 5 in that sequence. A span constraint requires that the pattern duration should be longer or shorter than a given time period. By setting a span constraint, we can focus on identifying shorter or longer sequences of events based on our specific requirements. For instance, if we set a short span constraint, we may discover frequent itemsets that occur closely together in a short period, while a larger span allows us to capture more spread-out occurrences. The maximal span constraint is anti-monotonic while the minimal span constraints is monotonic (Pei et al., 2002).

The gap constraint controls the minimum/maximum gap allowed between consecutive occurrences of items within a sequence. It specifies how many time units may intervene before an item is observed again. In Figure 2, the gap between \(login\_attempt\) and \(authorized\) is 4. Gap constraints are essential for capturing temporal relationships between events. Setting appropriate gap values helps identify patterns where there might be delays or interruptions between related events but still maintain their significance. The minimal and maximal gap constraints are anti-monotonic (Pei et al., 2002). A gap constraint imposes a constraint on all embeddings, if an embedding does not satisfy the constraint, the whole pattern is unsatisfied.

5 Our ASP-based approach

In this section, we describe the proposed ASP encoding^{Footnote 1} and discuss the rationale behind. We illustrate how we have modelled the contrast sequential pattern mining (CSPM) task and the problem input/output. We assume that the database contains sequences of itemsets. But for the sake of simplicity, we will restrict patterns to sequences of items (each itemset is a singleton). Figure 3 illustrates the CSPM task in ASP. Given the definition of the problem as in Section 4 and the set of sequences, MASS-CSP is the result of the encoding of the CSPM problem with the instance of the problem as facts. After the grounding step, i.e. the process of replacing variables in a logic program with all possible ground terms (constants) to create a variable-free program, the solving process computes answer sets. The solution of the problem of CSPM is a set of contrast sequential patterns, one for each answer set. Each answer set contains the atoms describing the pattern and the reference class of which the pattern is member. The solution relies on the “generate and test principle”: generate combinatorially all the possible patterns and their related occurrences in the database sequences and test whether they satisfy the specified constraints. Since CSPM merges the two notions of sequential pattern and contrast pattern, it is necessary to first extract the frequent sequential patterns from the input sequences and then check which of these regularities are actually contrast sequential patterns.

A sequence database \(\mathcal {D}\) is represented as a collection of ASP facts of the kind |seq(s,p,i)| and |cl(s,c)|, where the |seq| predicate says that an item |i| occurs at position |p| in a sequence |s| while the |cl| predicate says that |s| is labelled with class |c|. Listing 2 shows the ASP encoding of the problem instance of Table 1.

The sub-problem of mining frequent sequential patterns is encoded according to the principles outlined in Gebser et al. (2016) with the fill-gaps strategy encoding provided by Guyet et al. (2016) because of its efficiency compared to the skip-gaps strategy. First of all, it is important to decide whether a sequential pattern \(P =\langle p_1,\,\dots ,\,p_n \rangle\) supports a sequence \(S=\langle s_1,\,\dots ,\,s_m \rangle\) of the database. It means that exist a mapping \(e =(e_i)_{1\le i\le n}\) such that \(p_i=s_{e_i}\). This mapping is found by embedding of a pattern in a sequence i.e. the relation between pattern item indexes to sequence item indexes. As mentioned above, we have followed the fill-gap strategy to represent embeddings as illustrated in Figure 4 The strategy expresses that once a pattern item has been mapped to the leftmost item of the sequence (having the lowest index), the knowledge of this mapping is maintained on remaining sequence items. So, a fill-gaps embedding makes only explicit the leftmost admissible matches of pattern P items is sequence S.

The full ASP encoding for CSPM is reported in Listings 3 and 4. It encompasses two phases; the first aims at the discovery of frequent sequential patterns, the latter checks which among the discovered patterns are of contrast with some class. Listing 3 shows how to find sequential patterns. We start by acquiring all the elements other than the input sequences (Line 1). Lines 3-7 generate all possible candidate patterns combining all different elements of the sequences. Candidate patterns are generated taking into account a minimum and maximum length called |minlen| and |maxlen| respectively. From 9 to 11, pattern candidate occurrences are computed by analysing all sequences. Lines 13-15 find support of a candidate pattern and if its support is less than a minimum support threshold |misup|, it will not be a sequential pattern.

Listing 4 shows the second phase of MASS-CSPM. We have followed the same principles and definitions of Section 4.1 with a straightforward ASP implementation. In the code below, Lines 1-2 compute the cardinality of the datasets \(\mathcal {D}_1\) and \(\mathcal {D}_2\) whereas Lines 4-5 compute the support of a pattern s in \(\mathcal {D}_1\) and \(\mathcal {D}_2\) respectively. Lines 7–9 calculate the growth rate \(GR_{C_1}(s)\) in accordance with the formula in Section 4.1 of the main paper, while Line 7 capture the case of \(GR_{C_1}(s) = \infty\). ASP does not support the computation of formulas that return decimal values. For this reason, an external function has been developed which can be called from within ASP (with the |@| command followed by the function name). The result will no longer be treated in ASP as a constant but rather as a string. Analogously, Lines 11–13 encode the computation of \(GR_{C_2}\) as written in Section 4.1 and Line 11 concerns the infinite case for \(GR_{C_2}\). Finally, Lines 15-16 check if the sequence s in hand is a contrast pattern for either \(C_1\) or \(C_2\) by an external function because it compares decimal numbers. If the growth rate is less than |mincr|, a constant |no| is returned, |yes| otherwise. Line 15 sets the first term of the |csp| atom to |yes| in accordance with the formulas in Section 4.1 of the main paper. The denials at Lines 16-17 discards all answer sets that do not represent contrast patterns for any of the two classes.

Finally, each answer set comprises a single pattern of interest. More precisely, an answer set represents a (contrast) sequential pattern \(s = \langle s_i \rangle _{ minlen \le i \le maxlen }\) such that \(minlen\) and \(maxlen\) are the minimum and maximum pattern length. There are different program constant defined: |minsup| and |mincr| define the minimum support and contrast rate thresholds, |minlen| and |maxlen| represent the minimal and maximal pattern length while |c1| and |c2| are the two classes. For example the atoms |pat(1,login_attempt)|, |pat(2,authorized)| describe the contrast sequential pattern \(\langle login\_attempt, authorized \rangle\) for the database in Table 1, where the first argument expresses the position of the item inside the pattern. Listing 5 shows two of the thirteen contrast sequential patterns found for the set of sequences in Table 1 and discussed in Section 4.1. The patterns are represented by the predicate |pat/2|. The predicate |csp(a,c)| says that the pattern is a contrast sequential pattern for the class |c|.

5.1 Condensed representations encodings

Here we discuss about the rationale behind the ASP encoding to find closed and maximal patterns. A closed pattern is such that non of its frequent super-patterns has the same support. A maximal pattern is such that none of its super-patterns is frequent. It is necessary to compare the supports of several distinct patterns. Since a solution pattern is encoded through an answer set, a simple solution would be to compare all together. However, such facility is not provided by basic ASP language. The main idea to find condensed representations is adding additional constraints, like done in Guyet et al. (2016). These constraints are the following: a sequence S is maximal (resp. closed) if and only if for every sequence \(S'\) s.t. S is a subsequence of \(S'\) with \(|S'|=|S|+1\), then \(S'\) is not frequent (resp. \(S'\) has not the same support as S). The strategy is based on insertable items i.e. a pattern S is maximal iff any sequence \(S^j_a\), obtained by adding to S any item a at any position j, is not frequent. Such a is called insertable item. In the same way, a pattern S is closed iff for any frequent sequence \(S^j_a\), obtained by adding any item a at any position j in S, any sequence T that support S supports also \(S^j_a\).

We adopt the same encoding of Guyet et al. (2016) to find condensed representations of contrast sequential patterns. Listing 6 describes how to define the set of items that can be inserted between successive items of an embedding (Lines 1-8). These itemsets are encoded by the predicate |ins(t,x,i)| where |i| is an item which can be inserted in an embedding of the current pattern in sequence |t| between items at position |x| and |x+1| in the pattern. Here, only the positions of the last and the first valid occurrences are required for any pattern item. It can be observed that the strategy provides the first valid occurrence of an item |x| as the first atom of the |occ(t,x,_)| sequence. Then, computing the last occurrence for each pattern item can be done in a similar way by considering an embedding represented in reverse order. Lines 2 to 4 represent |occ/3| and |rocc/3| (reverse order) occurrences. The computation of insertable items (Lines 5-8) exploits the above remark. Line 5 defines the insertable region in a prefix using |rocc(t,1,p)|. Since items are insertable if they are strictly before the first position, we consider the value of |rocc(t,1,p+1)|. Line 6 uses |occ(t,l,p)| to identify the suffix region. Lines 7-8 combine both constraints for in-between cases.

Listings 7 and 8 are the ASP denials for dealing with closed and maximal patterns. The denial in Listing 7 concerns the extraction of closed patterns. It specifies that for each insertion position (from 1, in the prefix, to |maxlen+1|, in the suffix), it is not possible to have a frequent insertable item |i| for each supported sequence.

To extract only maximal patterns, the denial in Listing 8 denies patterns for which it is possible to insert an item which will be frequent within sequences that support the current pattern.

5.2 Gap and span encodings

In the following we briefly describe the choices to implement gap and span constraints in the MASS-CSP framework. While in Guyet et al. (2016) the authors presented the span and gap constraint encodings for the skip-gap technique, we here consider the encoding for the fill-gap technique, that is the one used for finding sequential patterns and described in Section 4.3. So our implementation complements previous work of Guyet et al. (2016). The goal is to improve efficiency of the overall ASP encoding of the CSPM problem. As discussed in Guyet et al. (2016), the gap and span constraints can be encoded in two ways in ASP. 1) We can use denials to delete answer sets that do not satisfy the constraints. The problem is that they act a posteriori during the test stage for validating candidate models. In this way we loose the benefits of applying the constraints. 2) A more effective method consists in introducing constraints in the generate stage for pruning the search space earlier. This is possible if we implement the gap and span constraints as choice rules. We have chosen to follow this second implementation strategy for the fill-gaps technique, that is the representation of embeddings adopted in MASS-CSP.

Span and gap constraints can be jointly applied by merging the two encodings above as shown in Listing 11.

6 Evaluation

Having demonstrated that modelling in ASP is powerful yet simple, it is now interesting to examine the computational behaviour of ASP-based encodings. In pattern mining, it is usual to evaluate the effectiveness (number of extracted patterns) and the time and space efficiency of an algorithm. Conversely, in ASP-based declarative pattern mining (DPM) approaches it is important to know the solver and grounder time. Grounding is a critical step in ASP because it transforms the first-order logical program into a propositional, variable-free program of which a solver can compute answer sets. Grounding can make the program explode in size if the program is not written carefully (e.g., too many variables or large domains). If the solving is slow, it might need to simplify rules or reduce the search space. Thus, the objective is to better understand the advantages and drawbacks of each encoding. The questions we would like to answer are the following: How does the grounding step behave in MASS-CSP? Do condensed representations really reduce computing resource needs in MASS-CSP? What is the behaviour of MASS-CSP when pattern constraints are added?

For the evaluation we have used the events logs made available by Arif et al. (2020) for the authentication failure and numb attack on 4G-LTE cellular network, and partition each event log into normal and attack sequences. So the two classes for the problem in hand are |normal| and |attack|. A comprehensive description of each log can be found in Arif et al. (2020) and in their github repository.^{Footnote 2} Table 2 summarizes the main features of the datasets used in the experiments. We have chosen these datasets because (i) they are suitable for the task considered in this article (classified sequences), (ii) they have been already used in the DPM literature and (iii) they are publicly available.

Table 2 Datasets used. The number within each dataset name is the number of sequences equally distributed in normal and attack. We report the dataset name, the number of distinct symbols (\(|\Sigma |\)), the number of sequences (|D|), the total number of symbols in the dataset (\(\Vert\)D\(\Vert\)), the maximum and the average sequence length (|T|), and the density that is calculated by \(\frac{||D||}{|\Sigma ||D|}\)

Full size table

As ASP system, we have used the version 5.4.0 of Clingo, with default solving parameters. The timeout (TO) has been set to 3600 seconds (1 hour). For the whole evaluation phase, we report the time (in seconds) and memory (in megabytes) requirements. Time information (grounding and solving) is given by Clingo while memory information is given by a script to capture memory consumption of the Clingo process. The ASP programs were run on a laptop computer with Ubuntu 20.04.6, AMD Ryzen 5 3500U @ 2.10 GHz, 8GB RAM without using the multi-threading mode of Clingo. Multi-threading reduces the mean runtime but introduces variance due to the random allocation of tasks. Such variance is inconvenient for interpreting results with repeated executions.

The whole section is organized as follow. In Section 6.1, results obtained with the basic version of MASS-CSP are reported. Next, in Section 6.2, we compare the grounding step and solving time-memory requirements of the basic MASS-CSP and condensed representations. Section 6.3, grounding and solving results about pattern constraints are discussed. Section 6.4 describes the work done for a comparative analysis with an hybrid ASP approach, given that no contrast sequential pattern mining algorithms is publicly available.

6.1 Results with basic MASS-CSP

In the first bunch of experiments we have tested the ability of the basic algorithm to extract patterns that characterize attack sequences. To this aim we have run it under different configurations to understand the amount of extracted patterns, time and memory requirements. We started by extracting patterns of minimum length of 2 and maximum length of 2 (|minlen| and |maxlen| parameters), then 3 up to 6 by varying the minimum support threshold (|minsup|) and the minimum contrast rate threshold (|mincr|). This work has been done for all versions of each dataset. In Lisi and Sterlicchio (2023), we showed how the encoding works with different input size in time and memory by increasing or decreasing |minsup| and/or |mincr|. In this paper, we go further and analyse how an increase in the length of the patterns affects the memory consumption and the time taken, considering the grounding and solving time. Figures 5 and 6 summarize what we said before on the largest available datasets, i.e., when we are looking for longer patterns (from 2 to 6), the program gets bigger and thus, more patterns are found, time is much higher and memory grows up. We have a huge number of patterns because the basic encoding does not take into account more effective and efficient constraints on embeddings like span and/or gap or different pattern representations like the condensed one.

6.2 Results for MASS-CSP with condensed representations

A significant challenge in pattern mining is the issue of pattern explosion, which refers to the excessive number of patterns generated by mining algorithms when using a minimum support threshold in database analysis. Condensed pattern representations have been suggested in the literature to tackle this problem. Figures 7, 8, 9 and 10 and illustrate the time and memory demands of MASS-CSP and its variations with closed and maximal patterns for both datasets.

Across all datasets, mining condensed patterns is more time-consuming for CSPM tasks. The three approaches show comparable performance only when the dataset size is small. When comparing closed to maximal patterns, the closed variant is generally more time-intensive except in the cases of Auth_failure_1000 and Numb_attack_1000 where time-out events lead to identical time requirements.

In terms of memory requirements, the condensed representation demands significantly greater resources compared to the basic formulation of MASS-CSP. Additionally, there is a notable variance in memory usage for smaller datasets. It can be generally concluded that closed patterns necessitate more memory than their maximal counterparts.

Table 3 shows the average grounding times in seconds for all datasets and encoding schemes using MASS-CSP. As the dataset size increases, the grounding time also increases for all encoding schemes, indicating the computational cost of processing larger datasets. MASS-CSP\(^{+m}\) generally outperforms or is comparable to MASS-CSP\(^{+c}\) across both Auth_failure and Numb_attack datasets, especially as the dataset size increases. For Auth_failure_40, MASS-CSP is notably faster than the other encoding schemes, but the performance gap narrows as the dataset size increases. In the Numb_attack_40, MASS-CSP\(^{+c}\) is the fastest among all encoding schemes, while for larger sizes, the difference becomes less substantial. Overall, MASS-CSP\(^{+m}\) seems to maintain a balance between execution time and increase in dataset size.

Table 3 Average grounding time (seconds) for MASS-CSP with condensed representations (\(c=closed\) and \(m=maximal\)).

Full size table

6.3 Results for MASS-CSP with span and gap constraints

In Section 4.3 we have described the type of constraints that can be added to the basic version of MASS-CSP to improve efficiency first, but also to better search for patterns by filtering out useless ones according to the new constraints. For example, we would like patterns of behaviour that have a certain temporal duration or that fall within a certain minimum and/or maximum temporal range. In the specific case of the datasets, contrast sequential patterns that describe sequential events during an attack may be useless if there are gaps between one item and another within the sequence. Interesting patterns are those that describe the evolution of the system whose items are sequential without gaps between them or understand the duration of bad sequential events that can led to an attack. This way we can accurately capture the crucial steps that may lead to detect an attack rather than stating that the evolution of the system is correct and is functioning normally.

To demonstrate the usefulness of the improvements described in Section 4.3, Figures 11, 12 and 13 make a comparison between the basic MASS-CSP (dotted lines) and the improved one with the span/gap constraints (continuous lines). The advantage of adding constraints on the gap when calculating the embeddings of a pattern is clear. First, we have control over the type of pattern we want with the minimum and maximum gap. The pattern set is considerably reduced, extracting only those actually useful for our purpose with an advantage on time and memory as we act directly in the pattern generation phase, having a smaller ground program than the previous one. Analogously, using the span constraint, we are able to reduce the number of patterns and the execution times, but it seems that we do not have improvement in memory consumption. This is because we are looking for patterns of a minimum and/or maximum duration and with a maximum length, but we have no control over the gap between items within the sequence. Obviously, the constraints on gap and span are successful when we want to find patterns because we reduce the search space and are sure to eliminate superfluous ones. When we apply jointly the two constraints (Gap+Span), the number of patterns is lower than with the gap and span constraint taken separately, but memory consumption and execution time are higher than with the gap constraint only. The gap constraint is the one that brings the best advantages in terms of overall performance.

Table 4 shows the average grounding times in seconds for all datasets and encoding schemes using MASS-CSP. For both Auth_failure and Numb_attack datasets, as the size of the dataset increases (e.g., from 40 to 1000), the grounding time consistently increases across all encoding methods. This indicates that larger datasets require more computational resources or time to ground, which is expected given the increased complexity and volume of data. The baseline encoding of MASS-CSP generally has the shortest grounding times across all datasets. Adding the gap constraint slightly increases the grounding time compared to the base MASS-CSP encoding, but the increase is relatively modest. The addition of the span constraint significantly increases grounding time compared to both MASS-CSP and MASS-CSP\(^{+g}\). This suggests that the span constraint introduces additional complexity that requires more processing time. Combining both gap and span constraints results in the longest grounding times across all datasets. In some cases, the grounding time nearly doubles or even triples compared to MASS-CSP\(^{+s}\) alone. This shows that combining these features adds considerable overhead to the grounding process.

Table 4 Average grounding time (seconds) for MASS-CSP and constraints (\(g = gap\) and \(s = span\))

Full size table

6.4 Comparative analysis

MASS-CSP is an ASP-based declarative approach to the contrast sequential pattern mining (CSPM) task. It would be interesting to compare the peculiarities of this approach, including time and memory requirements, with other declarative approaches (e.g. based on CP or SAT) or with procedural algorithms. Unfortunately, the comparison is not possible for several reasons. (I) The works cited on CSPM (Zheng et al., 2016; Wu et al., 2022) do not provide a public implementation to compare with MASS-CSP. (II) The other cited works refer to the Declarative Pattern Mining (DPM) research stream (Guyet et al., 2014; Gebser et al., 2016; Samet et al., 2017; Guyet et al., 2016; Besnard & Guyet, 2020; Paramonov et al., 2019), and the whole focus is on Sequential Pattern Mining (SPM) with comparisons also with other approaches such as the CP-based ones. SPM is the first of the two computation steps in MASS-CSP so it did not seem appropriate to test only on the sequential part since the task is on CSPM. Furthermore, the ASP encoding to extract sequential patterns, based on the fill-gaps technique (the one we use), has been widely compared with other methodologies in the past (Guyet et al., 2016). (III) MASS-CSP, to the best of our knowledge, is the first declarative work on the CSPM task based on ASP and therefore we have no way to compare it with other works that use ASP.

We therefore opted for a hybrid approach to pattern mining as suggested by the authors of the paper (Paramonov et al., 2019). In the past, we had already used a similar technique to compare an ASP-based approach with a hybrid approach that involved the use of an SPM algorithm such as PrefixSpan with ASP-based post-processing and obtained encouraging results from the efficiency point of view (Lisi & Sterlicchio, 2023). Based on these results, we adopt the same strategy for this work, that is, first use PrefixSpan to extract the sequential patterns, and then the MASS-CSP module designated to decide which of these patterns are contrasting. The results, however, are not satisfactory since the sequential patterns extracted by PrefixSpan become the input for the ASP encoding but the bottleneck appears in the grounding phase, with an explosion of memory consumption, which does not allow to proceed with the subsequent solving phase. In fact, the same hardware was used for the previous experiments. In our opinion, this happens due to the characteristics of the datasets that yield to the extraction of a large number of sequential patterns that are subsequently used as data facts in input to MASS-CSP with the above-mentioned consequences on the grounding performance. In fact, MASS-CSP extracts the contrast sequential patterns by considering a set of rules that allow for the pruning of many sequential patterns before being verified as contrast. Instead, with the hybrid approach, these rules are skipped because PrefixSpan is used as the first step to find the sequential patterns.

7 Conclusions

This article has presented MASS-CSP, a novel approach to the contrast sequential pattern mining (CSPM) task which is based on Answer Set Programming (ASP). The declarative nature of ASP allowed us to define the problem in an elegant way and with few rules. Furthermore, the presented constraints (gap and span) and condensed forms (closed and maximal) were encoded in an equally elegant way thanks to the modeling features of ASP. Users can easily express complex preferences and restrictions using ASP syntax, making the approach highly adaptable to different domains of science and specific analytical needs. Also, the approach moves towards a richer query language for pattern mining. Instead of being limited to pre-defined algorithms, users can essentially “query” the data for patterns that meet their specific criteria, opening up new possibilities for exploratory data analysis. Overall, a key benefit of declarative pattern mining is that it requires less development effort for well-defined tasks compared to procedural methods. Adding new constraints to our framework is easy, needing only a few lines of code, due to the flexibility of the ASP language and solvers.

Another goal of this paper was to convey to the reader that while encoding in ASP can be straightforward, creating efficient programs might be challenging. To develop competitive encodings, a thorough understanding of the solving process is necessary. We have proposed several potential improvements for the basic encoding of MASS-CSP. These encodings have been extensively tested on real-world datasets to assess the overall efficiency of this approach.

The novelty of this article lies in its strong adherence to the declarative paradigm, that is a distinguishing feature of ASP. Unlike imperative approaches that require the explicit specification of control flow and algorithmic steps, the declarative nature of ASP allows for the expression of complex problems through high-level logical specifications. This enables the user to focus on what the problem is, rather than how to solve it, thereby promoting clarity, modularity, and ease of maintenance. Thanks to the generate-and-test approach of ASP, MASS-CSP does an exhaustive search of all the possible exact sequential patterns present in the data set to possibly attribute them to a specific class based on threshold values for contrast rate. So, depending on the values considered, a very large value may find the very likely (but not in the probabilistic sense) ones but may cause others to be missed. Conversely, a low value could lead to noise and therefore an increase in false positives. Quite a lot is played on the threshold values of the measures considered (support and contrast rate).

Experiments with MASS-CSP have shown that applying a declarative approach to the CSPM task is feasible. Also, results have highlighted the benefits of adding constraints to embeddings. In particular, the span and gap constraints allow the set of patterns found to be pruned, thus decreasing the memory consumption and the execution time. The mining process has been made more effective since constraints better limit the search for desired behaviours. Moreover, since the constraints are implemented in the search phase for sequential patterns, the advantage is brought not only to CSPM tasks but also to any other pattern mining task that is based on sequence mining. Finally, since the only input is the set of labelled execution sequences, the approach applies also to other attacks on the same domain or even on other networks such as 5G but also in different context and domain of application.

To the best of our knowledge, MASS-CSP is the first declarative approach to the CSPM task. There are other traditional approaches in the literature that address the same task, but we could not make an empirical comparison due to the lack of publicly available implementations. So, analogously to what we did in a previous MASS-CSP experimentation (Lisi & Sterlicchio, 2023), we have tried to perform a comparative evaluation against a hybrid approach inspired by Paramonov et al. (2019). However, these experiments have produced an out-of-memory result due to the large amount of resources required by the grounding phase for the datasets considered in this paper. In future work, we plan to replicate the experiments with more performing hardware. Since MASS-CSP can only mine exact patterns, another direction of investigation could be to deal with uncertainty in the pattern computation by exploiting probabilistic ASP as framework.

Notes

References

Arif, M.F., Larraz, D., Echeverria, M., Reynolds, A., Chowdhury, O. & Tinelli, C. (2020). Syslite: Syntax-guided synthesis of PLTL formulas from finite traces. In: 2020 Formal Methods in Computer Aided Design (FMCAD), (pp. 93–103). https://doi.org/10.34727/2020/isbn.978-3-85448-042-6_16.
Besnard, P. & Guyet, T. (2020). Declarative mining of negative sequential patterns. In: DPSW 2020-1st Declarative Problem Solving Workshop, (pp. 1–8).
Brewka, G., Eiter, T., & Truszczynski, M. (2011). Answer set programming at a glance. Communications of the ACM, 54(12), 92–103. https://doi.org/10.1145/2043174.2043195
Article Google Scholar
Chen, Y., Gan, W., Wu, Y. & Yu, P.S. (2022). Contrast pattern mining: A survey. CoRR abs/2209.13556. https://doi.org/10.48550/ARXIV.2209.13556.
Coquery, E., Jabbour, S., Saïs, L. & Salhi, Y. (2012). A sat-based approach for discovering frequent, closed and maximal patterns in a sequence. In: Raedt, L.D., Bessiere, C., Dubois, D., Doherty, P., Frasconi, P., Heintz, F., Lucas, P.J.F. (eds.) ECAI 2012 - 20th European Conference on Artificial Intelligence. Including Prestigious Applications of Artificial Intelligence (PAIS-2012) System Demonstrations Track, Montpellier, France, August 27-31 , 2012. Frontiers in Artificial Intelligence and Applications, (Vol. 242, pp. 258–263). IOS Press, Amsterdam. https://doi.org/10.3233/978-1-61499-098-7-258.
Dong, G., & Bailey, J. (2013). Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press.
Google Scholar
Gebser, M., Guyet, T., Quiniou, R., Romero, J. & Schaub, T. (2016). Knowledge-based sequence mining with ASP. In: Kambhampati, S. (ed.) Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York 2016, (pp. 1497–1504). IJCAI/AAAI Press, USA. http://www.ijcai.org/Abstract/16/215.
Gebser, M., Kaminski, R., Kaufmann, B. & Schaub, T. (2014). Clingo = ASP + control: Preliminary report. CoRR abs/1405.3694. arXiv:1405.3694.
Gelfond, M., & Lifschitz, V. (1991). Classical negation in logic programs and disjunctive databases. New Gener. Comput., 9(3/4), 365–386. https://doi.org/10.1007/BF03037169
Article Google Scholar
Guns, T., Dries, A., Nijssen, S., Tack, G., & Raedt, L. D. (2017). Miningzinc: A declarative framework for constraint-based mining. Artificial Intelligence, 244, 6–29. https://doi.org/10.1016/J.ARTINT.2015.09.007
Article MathSciNet Google Scholar
Guyet, T. (2020). Enhancing sequential pattern mining with time and reasoning. Habilitation à diriger des recherches, Université de Rennes 1 (2020). https://theses.hal.science/tel-02495270.
Guyet, T., Happe, A. & Dauxais, Y. (2017). Declarative sequential pattern mining of care pathways. In: Teije, A., Popow, C., Holmes, J.H., Sacchi, L. (eds.) Artificial Intelligence in Medicine - 16th Conference on Artificial Intelligence in Medicine, AIME 2017, Vienna, Austria, June 21-24 2017. Proceedings. Lecture Notes in Computer Science, (Vol. 10259, pp. 261–266). Springer, Cham . https://doi.org/10.1007/978-3-319-59758-4_29
Guyet, T., Moinard, Y. & Quiniou, R. (2014). Using answer set programming for pattern mining. CoRR abs/1409.7777. arXiv:1409.7777.
Guyet, T., Moinard, Y., Quiniou, R. & Schaub, T. (2016). Efficiency analysis of ASP encodings for sequential pattern mining tasks. In: Pinaud, B., Guillet, F., Crémilleux, B., Runz, C. (eds.) Advances in Knowledge Discovery and Management - Volume 7 [Best of EGC 2016,Reims, France]. Studies in Computational Intelligence, (Vol. 732, pp. 41–81). Springer, Cham. https://doi.org/10.1007/978-3-319-65406-5_3.
Han, J., Pei, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U. & Hsu, M. (2001). Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th International Conference on Data Engineering, (pp. 215–224) . IEEE.
Jabbour, S., Sais, L. & Salhi, Y. (2015). Decomposition based SAT encodings for itemset mining problems. In: Cao, T.H., Lim, E., Zhou, Z., Ho, T.B., Cheung, D.W., Motoda, H. (eds.) Advances in Knowledge Discovery and Data Mining - 19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II. Lecture Notes in Computer Science, (Vol. 9078, pp. 662–674). Springer, Cham . https://doi.org/10.1007/978-3-319-18032-8_52.
Järvisalo, M. (2011). Itemset mining as a challenge application for answer set enumeration. In: Delgrande, J.P., Faber, W. (eds.) Logic Programming and Nonmonotonic Reasoning - 11th International Conference, LPNMR 2011, Vancouver, Canada, May 16-19, 2011. Proceedings. Lecture Notes in Computer Science, (Vol. 6645, pp. 304–310). Springer, Cham. https://doi.org/10.1007/978-3-642-20895-9_35.
Leone, N., Allocca, C., Alviano, M., Calimeri, F., Civili, C., Costabile, R., Fiorentino, A., Fuscà, D., Germano, S., Laboccetta, G., Cuteri, B., Manna, M., Perri, S., Reale, K., Ricca, F., Veltri, P. & Zangari, J. (2019). Enhancing DLV for large-scale reasoning. In: Balduccini, M., Lierler, Y., Woltran, S. (eds.) Logic Programming and Nonmonotonic Reasoning - 15th International Conference, LPNMR 2019, Philadelphia, PA, USA, June 3-7, 2019, Proceedings. Lecture Notes in Computer Science, (Vol. 11481, pp. 312–325). Springer, Cham. https://doi.org/10.1007/978-3-030-20528-7_23
Lifschitz, V. (2016). Answer sets and the language of answer set programming. AI Mag., 37(3), 7–12. https://doi.org/10.1609/AIMAG.V37I3.2670
Article Google Scholar
Lisi, F.A. & Sterlicchio, G. (2022). A declarative approach to contrast pattern mining. In: Dovier, A., Montanari, A., Orlandini, A. (eds.) AIxIA 2022 - Advances in Artificial Intelligence - XXIst International Conference of the Italian Association for Artificial Intelligence, AIxIA 2022, Udine, Italy, November 28 - December 2, 2022, Proceedings. Lecture Notes in Computer Science, (Vol. 13796, pp. 17–30). Springer, Cham. https://doi.org/10.1007/978-3-031-27181-6_2.
Lisi, F.A. & Sterlicchio, G. (2022). Declarative pattern mining in digital forensics: Preliminary results. In: Calegari, R., Ciatto, G., Omicini, A. (eds.) Proceedings of the 37th Italian Conference on Computational Logic, Bologna, Italy, June 29 - July 1, 2022. CEUR Workshop Proceedings, (Vol. 3204, pp. 232–246). CEUR-WS.org, Germany. https://ceur-ws.org/Vol-3204/paper_23.pdf.
Lisi, F.A. & Sterlicchio, G. (2022). Mining sequences in phone recordings with Answer Set Programming. In: Bruno, P., Calimeri, F., Cauteruccio, F., Maratea, M., Terracina, G., Vallati, M. (eds.) Joint Proceedings of the 1st International Workshop on HYbrid Models for Coupling Deductive and Inductive ReAsoning (HYDRA 2022) and the 29th RCRA Workshop on Experimental Evaluation of Algorithms for Solving Problems with Combinatorial Explosion (RCRA 2022) Co-located with the 16th International Conference on Logic Programming and Non-monotonic Reasoning (LPNMR 2022), Genova Nervi, Italy, September 5, 2022. CEUR Workshop Proceedings, (Vol. 3281, pp. 34–50). CEUR-WS.org, ??? . http://ceur-ws.org/Vol-3281/paper4.pdf.
Lisi, F.A. & Sterlicchio, G. (2023). Mining contrast sequential patterns with ASP. In: Basili, R., Lembo, D., Limongelli, C., Orlandini, A. (eds.) AIxIA 2023 - Advances in Artificial Intelligence - XXIInd International Conference of the Italian Association for Artificial Intelligence, AIxIA 2023, Rome, Italy, November 6-9, 2023, Proceedings. Lecture Notes in Computer Science, (Vol. 14318, pp. 44–57). Springer, Cham. https://doi.org/10.1007/978-3-031-47546-7_4.
Métivier, J., Loudni, S. & Charnois, T. (2013) . A constraint programming approach for mining sequential patterns in a sequence database. CoRR abs/1311.6907 arXiv:1311.6907.
Mooney, C., & Roddick, J. F. (2013). Sequential pattern mining - Approaches and algorithms. ACM Computing Surveys, 45(2), 19–11939. https://doi.org/10.1145/2431211.2431218
Article Google Scholar
Négrevergne, B. & Guns, T. (2015). Constraint-based sequence mining using constraint programming. In: Michel, L. (ed.) Integration of AI and OR Techniques in Constraint Programming - 12th International Conference, CPAIOR 2015, Barcelona, Spain, May 18-22, 2015, Proceedings. Lecture Notes in Computer Science, (Vol. 9075, pp. 288–305). Springer, Cham. https://doi.org/10.1007/978-3-319-18008-3_20.
Paramonov, S., Stepanova, D., & Miettinen, P. (2019). Hybrid ASP-based approach to pattern mining. Theory Pract. Log. Program., 19(4), 505–535. https://doi.org/10.1017/S1471068418000467
Article MathSciNet Google Scholar
Pei, J., Han, J. & Wang, W. (2002). Mining sequential patterns with constraints in large databases. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM ’02, pp. 18–25. Association for Computing Machinery, New York. https://doi.org/10.1145/584792.584799.
Pei, J., Han, J., & Wang, W. (2007). Constraint-based sequential pattern mining: The pattern-growth methods. Journal of Intelligent Information Systems, 28(2), 133–160. https://doi.org/10.1007/S10844-006-0006-Z
Article Google Scholar
Samet, A., Guyet, T. & Négrevergne, B. (2017). Mining rare sequential patterns with ASP. In: Lachiche, N., Vrain, C. (eds.) Late Breaking Papers of the 27th International Conference on Inductive Logic Programming, Orléans, France, September 4-6, 2017. CEUR Workshop Proceedings, (Vol. 2085, pp. 51–60). CEUR-WS.org, Germany. https://ceur-ws.org/Vol-2085/sametLBP-ILP2017.pdf.
Sterlicchio, G. & Lisi, F.A. (2024). Condensed representations for contrast sequential pattern mining in ASP. In: Angelis, E.D., Proietti, M. (eds.) Proceedings of the 39th Italian Conference on Computational Logic, Rome, Italy, June 26-28, 2024. CEUR Workshop Proceedings, Vol. 3733. CEUR-WS.org, Germany. https://ceur-ws.org/Vol-3733/short1.pdf.
Sterlicchio, G. & Lisi, F.A. (2024). Detecting patterns of attacks to network security in urban air mobility with answer set programming. In: Endriss, U., Melo, F.S., Bach, K., Diz, A.J.B., Alonso-Moral, J.M., Barro, S., Heintz, F. (eds.) ECAI 2024 - 27th European Conference on Artificial Intelligence, 19-24 October 2024, Santiago de Compostela, Spain - Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024). Frontiers in Artificial Intelligence and Applications, (Vol. 392, pp. 1285–1292). IOS Press, Amsterdam. https://doi.org/10.3233/FAIA240626.
Wu, Y., Wang, Y., Li, Y., Zhu, X., & Wu, X. (2022). Top-k self-adaptive contrast sequential pattern mining. IEEE Transactions on Cybernetics, 52(11), 11819–11833. https://doi.org/10.1109/TCYB.2021.3082114
Article Google Scholar
Zheng, Z., Wei, W., Liu, C., Cao, W., Cao, L., & Bhatia, M. (2016). An effective contrast sequential pattern mining approach to taxpayer behavior analysis. World Wide Web, 19(4), 633–651. https://doi.org/10.1007/S11280-015-0350-4
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.

Funding

Open access funding provided by Politecnico di Bari within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

DMMM, Polytechnic University of Bari, Bari, Italy
Gioacchino Sterlicchio
DIB and CILA, University of Bari Aldo Moro, Bari, Italy
Francesca Alessandra Lisi

Authors

Gioacchino Sterlicchio
View author publications
Search author on:PubMed Google Scholar
Francesca Alessandra Lisi
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Gioacchino Sterlicchio or Francesca Alessandra Lisi.

Additional information

Editors: Riccardo Guidotti, Anna Monreale, Dino Pedreschi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 116 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sterlicchio, G., Lisi, F.A. MASS-CSP: mining with answer set solving for contrast sequential pattern mining. Mach Learn 114, 235 (2025). https://doi.org/10.1007/s10994-025-06876-0

Download citation

Received: 02 April 2025
Revised: 02 July 2025
Accepted: 25 August 2025
Published: 28 September 2025
Version of record: 28 September 2025
DOI: https://doi.org/10.1007/s10994-025-06876-0

MASS-CSP: mining with answer set solving for contrast sequential pattern mining

Abstract

Similar content being viewed by others

Mining Contrast Sequential Patterns with ASP

Mining high utility contrast patterns in sequences

Efficiency Analysis of ASP Encodings for Sequential Pattern Mining Tasks

1 Introduction

2 Related works