Contrasting deep learning audio models for direct respiratory insufficiency detection versus blood oxygen saturation estimation Marcelo Matheus Gauy, Natália Hitomi Koza, Ricardo Mikio Morita, Gabriel Rocha Stanzione, Arnaldo Cândido Júnior, Larissa Cristina Berti, Anna Sara Shafferman Levin, Ester Cerdeira Sabino, Flaviane Romani Fernandes Svartman, Marcelo Finger Intelligence Based Medicine, 2026 This work aims to investigate the strengths and limitations of non-invasive audio-based deep learning methods for the detection of respiratory conditions. We contrast the performance obtained in tasks such as the expert-centered respiratory insufficiency (RI) detection with easily measured blood oxygen saturation (SpO2) estimation. Several deep learning audio models have been recently proposed for RI detection via voice and speech analysis; these models have obtained an accuracy of 95% in general patients and 97.4% in COVID-19 patients. Here, we extend those results, refining several pretrained audio neural networks (CNN6, CNN10 and CNN14) and Masked Autoencoders (Audio-MAE) for RI detection, showing that some of these models achieve near perfect accuracy (99.9% on COVID RI and 98.6% on general RI). The models were pretrained on AudioSet resulting in improved performance, with transfer learning playing a key role in the prevention of overfitting. The near-perfect RI detection performance suggests that low-cost and automated methods could be developed for assisting patient triage. In parallel, this paper seeks to verify SpO2 estimation feasibility, so we perform a 92% SpO2-threshold binary classification using the same architectures. In contrast to our findings for RI, this model yielded an accuracy below 70% and MCC-correlation below 0.3, indicating both that SpO2 estimation solely from audio is unfeasible and the presence of multiple features in the audios which are useful for RI detection, but not for SpO2 estimation. We propose that this discrepancy demonstrates the limits of voice and speech biomarkers across different diagnostic tasks under current technologies.
Dual-Bandwidth Spectrogram Analysis for Speaker Verification Rafaello Virgilli, Arnaldo Candido Junior, Augusto Seben da Rosa, Frederico S. Oliveira, Anderson da Silva Soares Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2025
Interpretability analysis of deep models for COVID-19 detection Daniel Peixoto Pinto da Silva, Edresson Casanova, Lucas Rafael Stefanel Gris, Marcelo Matheus Gauy, Arnaldo Candido Junior, Marcelo Finger, Flaviane Romani Fernandes Svartman, Beatriz Raposo de Medeiros, Marcus Vinícius Moreira Martins, Sandra Maria Aluísio, Larissa Cristina Berti, João Paulo Teixeira Artificial Intelligence in Health, 2024 During the coronavirus disease 2019 (COVID-19) pandemic, various research disciplines collaborated to address the impacts of severe acute respiratory syndrome coronavirus-2 infections. This paper presents an interpretability analysis of a convolutional neural network-based model designed for COVID-19 detection using audio data. We explore the input features that play a crucial role in the model’s decision-making process, including spectrograms, fundamental frequency (F0), F0 standard deviation, sex, and age. Subsequently, we examine the model’s decision patterns by generating heat maps to visualize its focus during the decision-making process. Emphasizing an explainable artificial intelligence approach, our findings demonstrate that the examined models can make unbiased decisions even in the presence of noise in training set audios, provided appropriate preprocessing steps are undertaken. Our top-performing model achieves a detection accuracy of 94.44%. Our analysis indicates that the analyzed models prioritize high-energy areas in spectrograms during the decision process, particularly focusing on high-energy regions associated with prosodic domains, while also effectively utilizing F0 for COVID-19 detection.
CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Frederico Santos de Oliveira, Lucas Oliveira, Ricardo Corso Fernandes Junior, Daniel Peixoto Pinto da Silva, Fernando Gorgulho Fayet, Bruno Baldissera Carlotto, Lucas Rafael Stefanel Gris, Sandra Maria Aluísio Language Resources and Evaluation, 2023 Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were around 376 h publicly available for the ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 h. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in several ASR applications. This paper presents CORAA (Corpus of Annotated Audios) ASR with 290 h, a publicly available dataset for ASR in BP containing validated pairs of audio-transcription. CORAA ASR also contains European Portuguese audios (4.6 h). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53, fine-tuned over CORAA ASR. Our model achieved a Word Error Rate (WER) of 24.18% on CORAA ASR test set and 20.08% on Common Voice test set. When measuring the Character Error Rate (CER), we obtained 11.02% and 6.34% for CORAA ASR and Common Voice, respectively. CORAA ASR corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.
Brazilian Portuguese Speech Recognition Using Wav2vec 2.0 Lucas Rafael Stefanel Gris, Edresson Casanova, Frederico Santos de Oliveira, Anderson da Silva Soares, Arnaldo Candido Junior Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2022
Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Hamilton Pereira da Silva, Sandra Maria Aluísio, Moacir Antonelli Ponti Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2021