MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-Shot Keyword Spotting

Abstract

User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.

Index Terms— zero-shot keyword spotting, contrastive learning, false alarm.

1 Introduction

Keyword spotting (KWS) enables intuitive human-computer interaction, facilitating the activation of voice assistants or smart devices with spoken commands, especially in hands-busy situations such as driving or gaming. Conventional KWS systems typically operate under a closed-set paradigm (using predefined wake words like “Hey Siri”, “OK Google” or others) and rely on extensive pre-defined training data [1, 2, 3]. While effective in controlled conditions, their fixed-vocabulary setting limits the adaptability to user-defined or previously unseen keywords, hindering personalization in realistic use cases. To overcome these restrictions, zero-shot KWS (ZSKWS) has emerged as a promising alternative [4, 5, 6], enabling the detection of arbitrary spoken keywords in given speech segments based merely on matching with their textual representations. This paradigm eliminates the need for keyword-specific audio data for training or fine-tuning, making it more geared to practical deployment.

Refer to caption — Fig. 1: An example of false alarm triggered by phonetic similarity between keywords.

One prominent ZSKWS approach is the cross-modal framework (CMCD) [7], which aligns audio utterances and textual queries in a shared embedding space, allowing for flexible keyword matching and obviating the reliance on keyword-specific audio examples. Building on this, several studies have proposed methods to enhance alignment accuracy and generalization [8, 9, 10]. Despite these advances, existing systems rely on coarse-grained global representations of spoken utterances, narrowing down their ability to resolve phonetically similar keyword pairs. As illustrated in Figure 1, semantically distinct keywords often share overlapping phonetic content (highlighted in orange), leading to high false alarms (FA). Meanwhile, MM-KWS [11] and CED [12] utilize Conformer [13] to facilitate cross-lingual feature robustness. However, the substantial computational demands of these large pre-trained audio encoders hinder real-time deployment on resource-constrained devices. Therefore, there is a need for lightweight, robust alternatives that can effectively address phonetically ambiguous keywords while maintaining efficiency in low-resource scenarios.

In light of this, we propose MALEFA¹¹1Implementation code : https://github.com/Debbyyy10158/MALEFA, a multi-granularity contrastive learning framework for ZSKWS. The proposed approach jointly learns utterance- and phoneme-level alignments, integrating cross-modal and contrastive learning techniques [14, 15, 16, 17] for improved ZSKWS. MALEFA is specifically designed to reduce FA, maintaining high detection accuracy under computational constraints.

Our main contributions are at least three-fold:

•

Multi-granularity contrastive learning: A unified framework combining utterance- and phoneme-level contrastive objectives to capture both the global semantics of keywords and their fine-grained pronunciation.
•

False alarm–aware loss: We propose a better-tailored loss that directly penalizes false positives (FP) through a sigmoid-based precision constraint, explicitly optimizing for a low false alarm rate (FAR) on ZSKWS tasks.
•

Lightweight on-device deployment: Our model achieves FAR of 0.007% and accuracy of 90% on public benchmarks, with just 650K parameters and 93M FLOPs.

Table 1: Comparison of MALEFA with prior ZSKWS models and ablation variants. Metrics include AUC (%

\uparrow

), EER (%

\downarrow

), and ACC₄ (%

\uparrow

) on Google Speech Commands (G), Qualcomm (Q), LibriPhrase Easy (L_E), and LibriPhrase Hard (L_H). The full MALEFA achieves the best overall performance with only 0.7M parameters, while removing PCL, UCL, or FA-aware loss degrades accuracy (ACC) or increases equal error rates (EER), confirming their complementary contributions.

Method	AUC (%) $\uparrow$				EER (%) $\downarrow$				ACC₄ (%) $\uparrow$	# Params
	G	Q	LP_E	LP_H	G	Q	LP_E	LP_H	Q
CMCD [7]	81.06	94.51	96.70	73.58	27.25	12.15	8.42	32.90	–	–
PhonMatchNet [8]*	98.11	98.90	99.29	88.52	6.77	4.75	2.80	18.82	80.45	0.7M
CED [12]	–	–	99.84	92.70	–	–	1.70	14.70	–	4.6M
CLAD [15]	–	–	97.03	76.15	–	–	8.65	30.30	–	2.2M
ADML [18]	–	–	99.86	88.71	–	–	1.33	20.09	–	1.8M
Ours	99.13	99.81	99.98	93.58	3.88	1.92	1.14	13.91	98.77	0.7M
w/o PCL	99.41	99.91	99.42	87.64	3.82	1.22	2.63	20.29	91.80	0.7M
w/o UCL	98.72	99.75	99.61	88.06	4.78	2.29	2.13	19.90	98.76	0.7M
w/o FA	94.83	97.57	96.07	86.47	9.85	8.16	8.62	21.10	84.19	0.7M

Table 2: False alarm rate (FAR, %

\downarrow

) comparison on AMI, Google Speech Commands (G), and Qualcomm (Q). Compared with PhonMatchNet, MALEFA reduces FAR by several orders of magnitude, achieving near-zero false alarms across all datasets.

Method	FAR (%) $\downarrow$
	AMI	G	Q
PhonMatchNet [8]*	17.879	7.438	5.743
Ours	0.007	0.002	0.000
w/o PCL	0.085	0.019	0.105
w/o UCL	1.334	3.580	0.029
w/o FA	14.542	6.710	0.690

2 Methodology

2.1 Feature Extractor

As schematically depicted in Fig. 2, MALEFA employs a two-stream encoder with separate audio and text encoders. Both audio and text modalities are processed independently and later aligned in the pattern extractor.

Audio encoder. Each utterance is passed through a pre-trained speech encoder [19] using a 775 ms window with 80 ms shift, producing 96-dimensional features. In parallel, the raw waveform is converted into a log-mel spectrogram (25 ms frame, 10 ms hop), subsequently projected by a lightweight trainable convolution layer. The two feature streams are concatenated to form the audio embedding $\mathbf{E}_{a}\in\mathbb{R}^{T_{a}\times 128}$ of the utterance, where $T_{a}$ is the number of frames. The experimental setup is identical to [8].

Text encoder. Keywords are first converted into phoneme sequences via a G2P converter [20], and each phoneme is embedded by a fully connected layer with ReLU activation, yielding $\mathbf{E}_{t}\in\mathbb{R}^{T_{t}\times 128}$ , where $T_{t}$ is the sequence length. Both $\mathbf{E}_{a}$ and $\mathbf{E}_{t}$ are augmented with sinusoidal positional encodings to capture temporal order and improve alignment robustness.

2.2 Pattern Extractor

As illustrated in Fig. 2, the pattern extractor employs cross-attention to align audio and text embeddings. The text embedding $\mathbf{E}_{t}$ serves as the query ( $Q$ ), while the audio embedding $\mathbf{E}_{a}$ provides both keys and values ( $K,V$ ). This allows each phoneme to attend to the most relevant audio frames, yielding a joint representation:

\mathbf{E}_{\text{joint}}=\text{CrossAttention}(Q=\mathbf{E}_{t},K=\mathbf{E}_{a},V=\mathbf{E}_{a}),

(1)

where $\mathbf{E}_{a}\in\mathbb{R}^{T_{a}\times 128}$ and $\mathbf{E}_{t}\in\mathbb{R}^{T_{t}\times 128}$ .

2.3 Pattern Discriminator

The joint embedding $\mathbf{E}_{\text{joint}}$ is passed through a Gated Recurrent Unit (GRU), followed by two classification heads. One head predicts utterance-level matching probability score between audio and text, denoted as $q_{\text{utt}}$ . While the other operates on temporal segments of $\mathbf{E}_{\text{joint}}$ to capture phoneme-level alignment sequence $\mathbf{q}_{\text{phon}}$ .

2.4 Multi-granularity Contrastive Learning

Phoneme-level Contrastive Learning (PCL). The audio encoder outputs frame-level CTC logits $\mathbf{z}\in\mathbb{R}^{T_{a}\times V}$ , supervised by the standard CTC loss [21]:

\mathcal{L}_{\text{CTC}}=-\log q_{\text{CTC}}(\mathbf{y}\mid\mathbf{z}),

(2)

where $q_{\text{CTC}}$ marginalizes over valid frame–phoneme alignments. With the aid of Viterbi decoding, we obtain alignment confidences $s_{i}$ for each audio-text pair. The corresponding PCL loss is

\mathcal{L}_{\text{PCL}}=\tfrac{1}{N}\sum_{i=1}^{N}\big[m_{i}(1-s_{i})^{2}+(1-m_{i})s_{i}^{2}\big],

(3)

where $m_{i}\in\{0,1\}$ indicates whether the $i$ -th pair is matched. This encourages high alignment confidence for positives and penalizes spurious overlaps instead.

Utterance-level Contrastive Learning (UCL). For a mini-batch of $M$ pairs, we compute a similarity matrix $S_{\text{utt}}\in\mathbb{R}^{M\times M}$ . Text-to-audio ( $s^{\text{text}}_{v,r}$ ) and audio-to-text ( $s^{\text{audio}}_{v,r}$ ) scores are optimized bidirectionally:

\mathcal{L}_{\text{UCL}}=\tfrac{1}{2}(\ell_{\text{text}}+\ell_{\text{audio}}),

(4)

with each term defined by

\ell_{*}=-\tfrac{1}{M}\sum_{v=1}^{M}\sum_{r=1}^{M}\big[m_{v,r}\log\sigma(s^{*}_{v,r})+(1-m_{v,r})\log(1-\sigma(s^{*}_{v,r}))\big].

(5)

Here $m_{v,r}=1$ if audio $v$ matches text $r$ , and $0$ otherwise. A mini-batch size of $M=5$ balances stability and discrimination.

2.5 False Alarm-aware Loss

False alarms (FA), i.e., false keyword detections on non-target audio, remain a major challenge facing KWS. Conventional BCE training maximizes overall accuracy but does not explicitly penalize FA, often requiring post-hoc threshold tuning. We therefore introduce a precision-constrained objective:

\mathcal{L}_{\text{FA}}=-\log(\text{Precision})+\lambda\cdot\max(0,\alpha-\text{Precision}),

(6)

where the first term discourages low-precision predictions and the second enforces a margin constraint if precision falls below $\alpha$ (scaled by $\lambda$ ) [22]. For differentiability, true positives (TP) and false positives (FP) are approximated by smooth sigmoid bounds:

	TP	$\displaystyle=\sum(1+\gamma\delta)\,\sigma(\gamma x-\delta)\,x_{\text{true}},$		(7)
	FP	$\displaystyle=\sum(1+\gamma\delta)\,\sigma(\gamma x+\delta)\,(1-x_{\text{true}}),$		(8)

with $x_{\text{true}}\in\{0,1\}$ , sigmoid $\sigma(\cdot)$ , steepness $\gamma=7.0$ , and offset $\delta=0.035$ . We set $\alpha=0.9$ and $\lambda=10.0$ . This auxiliary loss is combined with BCE to improve FA suppression during training.

2.6 Training Criterion

Both utterance- and phoneme-level predictions are supervised using BCE losses, denoted by $\mathcal{L}_{\text{utt}}$ and $\mathcal{L}_{\text{phon}}$ , respectively. The overall training objective combines all loss terms:

\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{utt}}+\mathcal{L}_{\text{phon}}+\mathcal{L}_{\text{CTC}}+\mathcal{L}_{\text{PCL}}+\mathcal{L}_{\text{UCL}}+\mathcal{L}_{\text{FA}},

(9)

where $\mathcal{L}_{\text{CTC}}$ (Eq. 2), $\mathcal{L}_{\text{PCL}}$ (Eq. 3), $\mathcal{L}_{\text{UCL}}$ (Eq. 4), and $\mathcal{L}_{\text{FA}}$ (Eq. 6) correspond to the CTC, phoneme-level contrastive, utterance-level, and FA-aware losses, respectively, all of which are assigned an equal weight of 1. To maintain the focus of this study, the exploration of alternative weighting strategies is left beyond the scope of this work.

3 Experimental Setup

3.1 Datasets

We use the LibriPhrase train-clean-100 and train-clean-360 sets with MUSAN noise [23] for training. Evaluation is conducted on four benchmarks: LibriPhrase Easy/Hard (L_E/L_H) from train-other-500 (low/high phonetic confusion), Google Speech Commands V2 (G) [24] (35 commands under diverse conditions), Qualcomm Keyword Speech (Q) [25] (accented/domain-specific keywords), and AMI [26] (12h meeting recordings segmented into 2s clips for FA evaluation).

3.2 Implementation Details

We employ Google speech embeddings [19] as the pre-trained audio encoder. All models are trained for 50 epochs using Adam with a fixed learning rate of $10^{-3}$ , batch size $N=1000$ , and mini-batch size $M=5$ for UCL (Section 2.4). Experiments are conducted on an NVIDIA RTX 4090 GPU using TensorFlow.

4 Experimental Results

4.1 Main Results

Table 1 compares MALEFA with prior ZSKWS models and presents an ablation study. While CED [12] achieves strong accuracy, its Conformer-based encoder [13] incurs much higher complexity, limiting on-device usage. Compared with PhonMatchNet [8], on LP_H, it suffers a significant drop (AUC $=88.52$ , EER $=18.82$ ), whereas Ours maintains higher robustness (AUC $=93.58$ , EER $=13.91$ ). Ablation results show that removing FA-aware loss (w/o FA) sharply increases EER, excluding UCL degrades robustness to phonetic ambiguities, and discarding PCL reduces fine-grained alignment on LP_H. In contrast, the full MALEFA, integrating all three components, achieving state-of-the-art performance across benchmarks with only 0.7M parameters. This confirms that FA-aware loss, UCL and PCL are jointly essential for reliable ZSKWS.

4.2 False Alarm Results

Table 2 reports the FAR across test sets. Compared with [8], which suffers from high FARs (17.9% on AMI), our MALEFA reduces FAR to below 0.01% on all benchmarks. Ablation further confirms the contribution of each component: removing FA-aware loss (w/o FA) causes the largest degradation (14.5% on AMI), excluding UCL increases FAR to 1.3%, and discarding PCL still yields higher FAR (0.085% on AMI). Overall, the complete MALEFA achieves the lowest FAR across all datasets, highlighting that PCL, UCL, and FA-aware learning are complementary and jointly indispensable.

4.3 Effects of Multi-granularity Contrastive Learning on Audio-Text Matching

To illustrate the impact of contrastive objectives, we visualize cosine similarity matrices for five representative keywords from the G dataset in Fig. 3. The baseline model (Original) exhibits high similarity not only on correct matches but also across confusable pairs (e.g., “bed” vs. “three”), indicating risk of FA. Introducing UCL enforces stronger inter-class separation, effectively suppressing non-matching similarities and yielding cleaner diagonal patterns. Further adding PCL sharpens the alignment, driving non-matching scores close to zero while preserving near-perfect self-matches. These results qualitatively confirm that UCL improves global discrimination and PCL complements it with fine-grained alignment, together enhancing robustness against phonetically similar triggers.

4.4 Effects of Phoneme-level Contrastive Learning on Frame-wise Alignment

Figure 4 compares phoneme-to-frame attention maps for the keyword “hey android” with and without PCL. Without PCL, the attention is diffused, yielding imprecise phonetic boundaries that may cause false alarm. With PCL, alignments become sharper and more localized, indicating that the model learns more discriminative frame-level representations and improves robustness against acoustically similar distractors.

5 Conclusion and Future Work

In this work, we have presented MALEFA, a lightweight ZSKWS framework that avoids reliance on large pre-trained models. By integrating multi-granularity contrastive learning with a novel false alarm-aware loss, MALEFA effectively captures global semantics and fine-grained pronunciations, and directly suppresses false triggers. Experiments show that MALEFA delivers state-of-the-art performance with 99% AUC, 1% EER, and an ultra-low FAR of 0.007%, making it highly suitable for resource-constrained deployments. In future work, we plan to explore cross-lingual extensions to improve robustness across diverse languages.

6 ACKNOWLEDGMENTS

This work was supported in part by Realtek Semiconductor Corporation under Grant Numbers 113KK01103 and 114KK01005. Any findings and implications in the paper do not necessarily reflect those of the sponsors.

References

[1] Tara N. Sainath and Carolina Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015, pp. 1478–1482.
[2] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.
[3] Iván López-Espejo, Zheng-Hua Tan, John H. L. Hansen, and Jesper Jensen, “Deep spoken keyword spotting: An overview,” IEEE Access, vol. 10, pp. 4169–4199, 2021.
[4] Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee, Kiho Cho, and Sung-Un Park, “End-to-end transformer-based open-vocabulary keyword spotting with location-guided local attention,” in Proc. Interspeech, 2021, pp. 361–365.
[5] Themos Stafylakis and Georgios Tzimiropoulos, “Zero-shot keyword spotting for visual speech recognition in-the-wild,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 513–529.
[6] Bolaji Yusuf, Alican Gok, Batuhan Gundogdu, and Murat Saraclar, “End-to-end open vocabulary keyword search,” in Proc. Interspeech, 2021, pp. 4388–4392.
[7] Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, and Hong-Goo Kang, “Learning audio-text agreement for open-vocabulary keyword spotting,” arXiv preprint arXiv:2206.15400, 2022.
[8] Yong-Hyeok Lee and Namhyun Cho, “Phonmatchnet: Phoneme-guided zero-shot keyword spotting for user-defined keywords,” in Proceedings of Interspeech. IEEE, 2023.
[9] Ao Zhang, Pan Zhou, Kaixun Huang, Yong Zou, Ming Liu, and Lei Xie, “U2-KWS: Unified two-pass open-vocabulary keyword spotting with keyword bias,” in Proc. IEEE ASRU Workshop, 2023.
[10] Aviv Navon, Aviv Shamsian, Neta Glazer, Gill Hetz, and Joseph Keshet, “Open-vocabulary keyword-spotting with adaptive instance normalization,” arXiv preprint arXiv:2309.08561, 2023.
[11] Zhiqi Ai, Zhiyong Chen, and Shugong Xu, “Mm-kws: Multi-modal prompts for multilingual user-defined keyword spotting,” arXiv preprint arXiv:2406.07310, 2024.
[12] Kumari Nishu, Minsik Cho, Paul Dixon, and Devang Naik, “Flexible keyword spotting based on homogeneous audio-text embedding,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 5050–5054.
[13] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
[14] Yusong Wu, Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “Large-scale contrastive language-audio pretraining (clap),” arXiv preprint arXiv:2211.06687, 2022.
[15] Yu Xi, Baochen Yang, Hao Li, Jiaqi Guo, and Kai Yu, “Contrastive learning with audio discrimination for customizable keyword spotting in continuous speech,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
[16] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, 2021, pp. 8748–8763.
[17] Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, and Du Jun, “Phoneme-level contrastive learning for user-defined keyword spotting with flexible enrollment,” arXiv preprint arXiv:2412.20805, 2024.
[18] Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, and Hoon-Young Cho, “Adversarial deep metric learning for cross-modal audio-text alignment in open-vocabulary keyword spotting,” arXiv preprint arXiv:2505.16735, 2025.
[19] James Lin, Kevin Kilgour, Dominik Roblek, and Matthew Sharifi, “Training keyword spotters with limited and synthesized speech data,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7474–7478.
[20] Jongseok Park, Kyubyong & Kim, “g2pe,” https://github.com/Kyubyong/g2p, 2019.
[21] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning. IEEE, 2006, ICML ’06, p. 369–376.
[22] Preetish Rath and Michael Hughes, “Optimizing early warning classifiers to control false alarms via a minimum precision constraint,” in Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, Eds., 28–30 Mar 2022, vol. 151 of Proceedings of Machine Learning Research, pp. 4895–4914.
[23] David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[24] Pete Warden, “Speech commands: A dataset for limited-vocabulary keyword spotting,” in Proceedings of Interspeech, 2018.
[25] Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang, “Query-by-example on-device keyword spotting,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU), 2019, pp. 532–538.
[26] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Maël Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al., “The ami meeting corpus: A pre-announcement,” in Proc. International Workshop on Machine Learning for Multimodal Interaction (MLMI). Springer, 2005, pp. 28–39.