The Kolmogorov-Arnold Networks: A New Foundation Paradigm for Hierarchical Function Learning in ASV Anti-Spoofing Research Vamsi Krishna Badugu, Suresh Veesa, Madhusudan Singh IEEE International Conference on Communication Networks and Satellite Comnetsat, 2025 The widespread adoption of automatic speaker verification (ASV) systems in security critical applications needs further exploration towards the development of more generalized countermeasure systems against various spoofing attacks, such as replay, voice conversion, speech synthesis, and deepfake based manipulations. Although traditional deep learning based countermeasures are in advanced stage, still they are facing major challenges such as limited generalization to unseen attack types, high model complexity, and poor interpretability. Recently, Kolmogorov-Arnold Networks (KANs) rooted in the Kolmogorov-Arnold Representation Theorem (KART) have emerged as a promising new direction for audio anti-spoofing research. This review traces the historical progression of spoofing attacks and countermeasure developments, provides an analytical overview of deep learning architectures used for spoof detection, and explores the growing role of KAN based approaches and their variants. Furthermore, it highlights both the theoretical and practical strengths of KANs, their potential integration with conventional deep architectures as well as standalone KAN designs, and critically examines their current challenges. The paper concludes by outlining prospective research directions for developing more robust, interpretable, and generalizable anti-spoofing systems.
Exploring Source Features with Deep Residual Neural Networks for Replay Attack Detection Suresh Veesa, Badugu Vamsi Krishna, Madhusudan Singh 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025, 2025 Replay spoofing raises a serious challenge to the reliability of automatic speaker verification (ASV) systems, particularly in real-world applications. While most countermeasures have concentrated on spectral features, excitation source information based features remains underexplored. This study addresses this gap by leveraging Linear Prediction Residual (LPR) features, which capture critical excitation source characteristics relevant for replay detection. Specifically, we investigate the effectiveness of Residual Constant Q Cepstral Coefficient (RCQCC), Residual Mel-Frequency Cepstral Coefficient (RMFCC), and Residual Phase Constant Q Cepstral Coefficient (RPCQCC) features, in conjunction with log-spectrogram representations. A deep residual neural network (DRNN) classifier is developed to fully exploit these LPR-based features. Evaluation on the ASVspoof 2017v2.0 (17PA) and ASVspoof 2019 PA (19PA) datasets demonstrates that fusing spectral and source-based features significantly improves detection performance. The best fusion model reports an equal error rate (EER) value of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{8. 4 0 \%}$</tex> on 17PA and records a tandem detection cost function value of (<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$t$</tex>-DCF) 0.1447 on 19PA. These findings highlight the value of integrating excitation source information with spectral features and advanced deep learning models to strengthen replay spoofing countermeasures in ASV systems.
Linear Prediction Networks for Residual based Replay Speech Detection Suresh Veesa, Badugu Vamsi Krishna, Madhusudan Singh Indiscon 2025 IEEE 6th India Council International Subsections Conference Proceedings, 2025 The linear prediction residual(LPR) component of speech signal plays a crucial role in detecting replay attacks. Replay is the simplest method fraudsters use to deceive automatic speaker verification systems into accepting a fake speaker. The LPR signal conveys information about the excitation source, making it valuable for differentiating between spoofed and authentic speech samples. This work introduces a deep learning classifier called Linear Prediction Network (LPNet), which uses deep residual layers and takes LPR features as input. We explore various source features to evaluate the usefulness of LPR information in replay detection. The source features employed in this study include residual constant Q-cepstral coefficients(RCQCC), Residual mel-frequency cepstral coefficients(RMFCC), and residual phase constant Q cepstral coefficients(RPCQCC). These features serve as the front end for proposed models. Replay speech detection experiments were conducted using standard ASV spoof 2017 version2.0 database. With the LPNet classifier, the source features have provided $14.71 \%, 21.03 \%$, and 14.84% equal error rates (EERs), respectively. The score-level combinations of all three source features along with the popular CQCC feature achieved an 8.62% EER. The proposed combination outshines state-of-the-art replay detection techniques, thereby motivating readers to further explore in this direction.
Fusion of RMFCC and RCQCC Features for Replay Attack Detection Task Suresh Veesa, Badugu Vamsi Krishna, Madhusudan Singh Proceedings of 2023 IEEE International Conference on Internet of Things and Intelligence Systems Iotais 2023, 2023 In this work, the linear prediction (LP) residual, also known as excitation component of speech is processed for detecting replay attacks. The LP residual is derived from speech using LP analysis method with proper LP order. It represents excitation source information, in implicit form. Also, the features derived from LP residual using signal processing algorithms are referred as excitation source features. The two source features, namely, residual mel-frequency cepstral coefficients (RMFCC) and residual constant-Q-cepstral coefficients (RCQCC) has been derived and used for replay attack detection task. The Gaussian mixture model (GMM) is used as back-end classifier. The experimental study is conducted using ASVspoof 2017 Version 2.0 database. The RMFCC-GMM and RCQCC-GMM systems provides 20.89% and 18.51% EERs. The score level fusion of both systems result a notable 11.72% EER, indicating significant complementary information content in both features useful for replay speech detection task. This infers that combining source features obtained from LP residual with suitable signal processing methods may become better alternatives over existing solutions under replay attack detection context. Further, score level fusion of RMFCC, RCQCC and CQCC features provides 9.18% EER, the best reported performance in this work.