A high-performance Rust implementation of formal language theory constructs, focusing on Context-Free Grammars (CFGs) and their intersection with regular languages. This library is inspired by and partly a port of pyformlang, with somewhat similar APIs and heavily optimized CFG/DFA intersection operations.
- CFG-DFA Intersection: Efficient computation of the intersection between context-free and regular languages
- Grammar Normalization: Conversion to Chomsky Normal Form and other canonical forms
- Grammar Reduction: Removal of useless symbols, epsilon productions, and unit productions and further heuristics for language size reduction.
- Deterministic Finite Automata (DFA), Non-deterministic Finite Automata (NFA) and epsilon-NFAs
- Automata Operations: Union, intersection, complement, minimization, and determinization
- Bytes DFA: Specialized DFA implementation where the alphabet is the range of bytes (0-255)
- Regex to DFA Conversion: Direct compilation of regular expressions to DFAs
- Pattern Matching: Efficient string matching using compiled automata
- Lexical Analysis: Sophisticated tokenization with configurable lexing maps
- Grammar Substitution: Replace terminals with other grammars for compositional language construction
- Example Word Generation: Generate example strings from grammars and automata
- Performance Optimizations: Parallel processing, caching, and optimized data structures
Copy this directory into your project.
Add this to your Cargo.toml:
[dependencies]
rustformlang = { path = "./rustformlang" }use rustformlang::cfg::{cfg::CFG, variable::Variable};
let grammar_text = r#"
S -> A B | C
A -> a
B -> b
C -> c
"#;
let cfg = CFG::from_text(grammar_text, Variable::new("S"));use rustformlang::fa::{dfa::DFA, state::State};
use rustformlang::input_symbol::InputSymbol;
let mut dfa = DFA::empty();
let start = State::new("q0");
let accept = State::new("q1");
dfa.set_start_state(start.clone());
dfa.add_accept_state(accept.clone());
dfa.add_transition(&start, &InputSymbol::new("a"), &accept);
// Check if DFA accepts a string
let input = vec![InputSymbol::new("a")];
println!("Accepts 'a': {}", dfa.accepts(&input));use rustformlang::regex::regex::regex_to_dfa;
// Create minimal DFA from regex pattern
let dfa = regex_to_dfa("a(b|c)*d");
// Check if string matches pattern
let input = "abcbcd";
println!("Matches pattern: {}", dfa.accepts_string(input));Note that this creates a ByteDFA, i.e., a DFA specialized for the alphabet of 0-255.
use rustformlang::regex::regex::regex_to_dfa;
let cfg_text = r#"
S -> A B
A -> a
B -> b
"#;
let cfg = CFG::from_text(cfg_text, Variable::new("S"));
// Create DFA from regex
let dfa = regex_to_dfa("(ab)*").to_deterministic();
// Compute intersection and check emptiness (fast!)
let intersection_empty = cfg.is_intersection_empty(&dfa, None);
println!("Intersection is empty: {}", intersection_empty);
// Compute intersection and check emptiness (expensive!)
let intersection = cfg.intersection(&dfa);
println!("Intersection is empty: {}", intersection.is_empty());use rustformlang::constraining::{LexMap, StringOrDFA, compile_lex_map, lex, Token};
use rustformlang::input_symbol::InputSymbol;
use std::collections::HashMap;
// Define lexing rules
let mut lex_map = LexMap::new();
lex_map.insert(InputSymbol::new("NUMBER"), StringOrDFA::String(r"\d+".to_string()));
lex_map.insert(InputSymbol::new("WORD"), StringOrDFA::String(r"\w+".to_string()));
let compiled_map = compile_lex_map(lex_map, &HashMap::new());
// Tokenize input
let token = Token::Word("hello123".to_string());
let lexings = lex(&token, &compiled_map, true, &None);// Clean grammar (remove useless symbols, epsilon productions, unit productions)
let clean_cfg = cfg.cleaned();
// Convert to C2F+ε (automatically applies cleaning)
let normal_cfg = cfg.to_normal_form();
// Check properties
println!("Generates epsilon: {}", cfg.generates_epsilon());
println!("Is in normal form: {}", cfg.is_normal_form());
println!("Is empty: {}", cfg.is_empty());use std::collections::HashMap;
// Substitute terminals with other grammars
let substitution_grammar = CFG::from_text("S -> x y", Variable::new("S"));
let substitutions = HashMap::from([(Terminal::new("a"), &substitution_grammar)]);
let substituted_cfg = cfg.substitute(&substitutions);
// Concatenate grammars
let cfg2 = CFG::from_text("S -> d", Variable::new("S"));
let concatenated = CFG::concat(vec![&cfg, &cfg2]);// Minimize DFA
let minimized = dfa.minimize();
// Complement
let complement = dfa.complement();
// Intersection of two DFAs
let intersection = dfa1.intersection(&dfa2);
// Check emptiness
println!("Is empty: {}", dfa.is_empty());
// Generate example word
if let Some(word) = dfa.example_word() {
println!("Example word: {:?}", word);
}// Prefix languages
let prefix_lang = dfa.prefix_language();
let true_prefix_lang = dfa.true_prefix_language();
// Accept prefix of input
let input = "hello world";
if let Some(prefix_len) = dfa.accept_prefix_string(input) {
println!("Accepted prefix length: {}", prefix_len);
}- Parallel Processing: Lexical analysis uses Rayon for parallel computation
- Caching: Intelligent caching of lexing results and grammar computations
- Optimized Data Structures: Uses
hashbrown::HashMapandsmallvec::SmallVecfor performance - Memory Efficient: Careful memory management with reference counting and lazy evaluation
use rustformlang::constraining::{prelex_word, all_prelex_words};
// Apply preprocessing to words before lexing
let prelexed = prelex_word("hello", "[]", true, false);
let all_lexings = all_prelex_words("hello", Some("[]"), &compiled_map, Some(" \t\n"));// Get generating and nullable symbols
let generating = cfg.get_generating_symbols();
let nullable = cfg.get_nullable_symbols();
let reachable = cfg.get_reachable_symbols();
// Get production rules
let productions = cfg.get_productions();
for production in productions {
println!("{}", production);
}use std::time::Duration;
// Check intersection emptiness with timeout
let timeout = Duration::from_secs(10);
let is_empty = cfg.is_intersection_empty(&dfa, Some(timeout));
// Generate example word with timeout
let example = cfg.example_word(&dfa, Some(timeout));The library includes a JavaScript grammar benchmark that demonstrates performance on real-world grammars:
cargo run --bin js_grammar_bench --releaseRun the test suite:
cargo testThe library is organized into several key modules:
cfg: Context-free grammar implementationfa: Finite automata (DFA, NFA, epsilon-NFA)regex: Regular expression to automata conversionconstraining: Lexical analysis and tokenizationinput_symbol: Symbol representation and utilitieslanguage: Language trait definitions
Contributions are welcome! Please ensure all tests pass and follow the existing code style.
This project is licensed under the MIT License - see the LICENSE file for details.
This library is inspired by and compatible with pyformlang.