A Sample-Based, Multistage Machine Learning Pipeline for Scalable IoT Threat Detection Marcelo V. C. Aragão, Tiago de M. Pereira, Felipe A. P. de Figueiredo, Samuel B. Mafra IEEE Embedded Systems Letters, 2026 The rapid growth of IoT devices demands scalable and efficient threat detection solutions. This paper introduces a sample-based, multi-stage machine learning (ML) pipeline for IoT threat detection using the CICIoT2023 dataset, integrating feature selection, data balancing, and hyperparameter optimization to improve detection accuracy while reducing the computational overhead associated with training. We evaluate, and across binary, multiclass, and fine-grained tasks, showing that with 10% sampling achieves the best trade-off between accuracy and efficiency. Compared to prior methods, our approach eliminates GPU dependence, maintains low latency, and preserves state-of-the-art performance while enabling scalable training for high generalization capacity. Additionally, we provide model selection guidelines based on dataset complexity and computational constraints. The results show that training with a sample-based approach enables effective threat detection on large datasets, producing models that generalize well to diverse IoT attack scenarios, thus ensuring practical applicability in real-world deployments.
Large-Scale Benchmarking of Intrusion Detection Datasets With GPU-Accelerated Data Pipelines, Complexity Analysis, and Model Evaluation Marcelo V. C. Aragão, Felipe A. P. de Figueiredo, Samuel B. Mafra International Journal of Intelligent Systems, 2026 Intrusion detection systems (IDSs) are critical for identifying malicious activity in computer networks; however, the evaluation of machine learning (ML)–based IDS remains inconsistent and fragmented. Many existing studies rely on outdated datasets, neglect computational complexity, or use limited performance metrics. Additionally, few works leverage the full potential of modern graphics processing unit (GPU) acceleration. The objective of this study is to establish a scalable, reproducible, and standardized benchmarking framework for intrusion detection. We present an end‐to‐end, GPU‐accelerated pipeline that integrates automated data preprocessing, intrinsic dataset complexity analysis, and multiobjective hyperparameter optimization (HPO) across more than 70 publicly available datasets. Our numerical findings demonstrate that stratified sampling rates of 10% are sufficient to maintain statistical signal integrity, with class probability deviations remaining below 0.01 relative to the full population. Furthermore, feature‐reduced configurations decrease the model size by a median of 60% while maintaining weighted F 1 scores within 0.01 of the baseline. Finally, experimental complexity analysis reveals that the GPU‐accelerated modeling stages achieve empirical time‐invariance ( O (1)), reducing training latency by up to two orders of magnitude compared with traditional central processing unit (CPU) workflows. These contributions offer a rigorous quantitative view of the performance‐efficiency trade‐offs essential for next‐generation IDS evaluation.
A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification Marcelo V. C. Aragão, Augusto G. Afonso, Rafaela C. Ferraz, Rairon G. Ferreira, Sávio G. Leite, Felipe A. P. de Figueiredo, Samuel B. Mafra Scientific Reports, 2025 Selecting the most suitable Automated Machine Learning (AutoML) tool is pivotal for achieving optimal performance in diverse classification tasks, including binary, multiclass, and multilabel scenarios. The wide range of frameworks with distinct features and capabilities complicates this decision, necessitating a systematic evaluation. This study benchmarks sixteen AutoML tools, including AutoGluon, AutoSklearn, TPOT, PyCaret, and Lightwood, across all three classification types using 21 real-world datasets. Unlike prior studies focusing on a subset of classification tasks or a limited number of tools, we provide a unified evaluation of sixteen frameworks, incorporating feature-based comparisons, time-constrained experiments, and multi-tier statistical validation. We also compared our findings with four representative prior benchmarks to contextualize our results within the existing literature. A key contribution of our study is the in-depth assessment of multilabel classification, exploring both native and label-powerset representations and revealing that several tools lack robust multilabel capabilities. Our findings demonstrate that AutoSklearn excels in predictive performance for binary and multiclass settings, albeit at longer training times, while Lightwood and AutoKeras offer faster training at the cost of predictive performance on complex datasets. AutoGluon emerges as the best overall solution, balancing predictive accuracy with computational efficiency. Our statistical analysis—at per-dataset, across-datasets, and all-datasets levels—confirms significant performance differences among tools, highlighting accuracy-speed trade-offs in AutoML. These insights underscore the importance of aligning tool selection with specific problem characteristics and resource constraints. The open-source code and reproducible experimental protocols further ensure the study’s value as a robust resource for researchers and practitioners.
Dynamic-Balancing AutoML for Imbalanced Tabular Data With Adaptive Resampling and Complexity-Aware Analysis Marcelo V. C. Aragão, Tiago de M. Pereira, Mateus de F. Carvalho, Felipe A. P. de Figueiredo, Samuel B. Mafra International Journal of Intelligent Systems, 2025 Handling class imbalance is a fundamental challenge in supervised learning, particularly in real‐world scenarios where minority classes are critical yet underrepresented. This paper presents a novel dynamic‐balancing pipeline that enhances automated machine learning (AutoML) performance on imbalanced tabular datasets. The proposed approach integrates both traditional and generative resampling techniques with adaptive, class‐specific thresholds, enabling automated and dataset‐sensitive balancing strategies. To assess its generalizability, the pipeline is applied uniformly across binary, multiclass, and multilabel classification tasks. Each configuration is evaluated within an AutoML framework using performance and efficiency metrics, with outcomes validated through statistical testing and effect size analysis. The study also incorporates dataset complexity measures—including feature‐label dependency and class overlap—to investigate how structural characteristics affect balancing efficacy. By combining principled resampling, exhaustive grid search, and rigorous evaluation, the pipeline enables more robust and efficient AutoML workflows. This work contributes a flexible and reproducible framework for addressing class imbalance, particularly in multilabel contexts, and establishes a foundation for scalable, complexity‐aware resampling in automated model development.
Interactive Control System for Automated Guide Vehicles João Paulo Carvalho Henriques, Daniel Nunes Teixeira, Matheus Brandani Mendes Rosa, Miguel José Abdala Ribeiro, Egídio Raimundo Neto, Marcelo Vinicius Cysneiros Aragão, João Pedro Maglhães de Paula Paiva 2025 13th International Conference on Control Mechatronics and Automation Iccma 2025, 2025 This work aims to present a new mapping methodology for AGV (Automated Guided Vehicle) systems. The goal is to achieve vehicle positioning through a PID (Proportional, Integral, and Derivative) controller, without requiring physical interaction with the path to be followed. The article provides a broad overview of data processing and applying techniques related to achieving accurate positioning without external feedback. This article presents satisfactory results, even though some sensor limitations must be considered.
A Study and Evaluation of Classifiers for Anti-Spam Systems Marcelo V. C. Aragao, Isaac C. Ferreira, Edvard M. Oliveira, Bruno T. Kuehne, Edmilson M. Moreira, Otavio A. S. Carpinteiro IEEE Access, 2021 The volume of e-mails has been increasing in recent years. However, since 2005, at least half of these e-mails have been made up of spam. This massive traffic of unwanted messages causes losses to users, such as the excessive and unnecessary use of the bandwidth of their networks, loss of productivity, exposure of inappropriate content to inappropriate audiences etc. This paper proposes the study and the application of machine learning models to the classification of e-mails in existing anti-spam systems and, in particular, in the new anti-spam system Open-MaLBAS. After carrying out many experiments on different data sets, it was possible both to prove the feasibility of the proposal and to develop a powerful combination of techniques, methods, and models that can be successfully applied to the classification of e-mails in anti-spam systems.
The Development of the Open Machine-Learning-Based Anti-Spam (Open-MaLBAS) Isaac C. Ferreira, Marcelo V. C. Aragão, Edvard M. Oliveira, Bruno T. Kuehne, Edmilson M. Moreira, Otávio A. S. Carpinteiro IEEE Access, 2021 Spam e-mails are unsolicited e-mails received by users of the e-mail service. Spam e-mails cause serious harm to organizations, for they waste, among other things, their computational and networking resources. To reduce the damage caused by them, organizations use anti-spams. Anti-spams are software systems that classify e-mails in order to separate legitimate from spam e-mails. The best current commercial and open-source anti-spams, and in particular the well-known commercial anti-spam CanIt-PRO, make use of various techniques, such as blacklists and/or SMTP extensions, to classify e-mails. Unfortunately, both blacklists and SMTP extensions have serious drawbacks, such as low scalability and high computational and network costs. This paper introduces the Open Machine-Learning-Based Anti-Spam (Open-MaLBAS). Unlike the best current anti-spams, Open-MaLBAS does not make use of blacklists and SMTP extensions, but only of machine learning models for e-mail classification. Open-MaLBAS was compared to CanIt-PRO in a series of experiments on a database composed of 862,227 real e-mails, collected over three months at the Federal University of Itajubá, Brazil. The e-mails were previously classified by CanIt-PRO. From the experiments, it was observed that Open-MaLBAS was able to correctly classify 81.48% and 98.13% of the e-mails in the database, using, respectively, the two models — Multi-Layer Perceptron and Random Forest — evaluated. In addition, it managed to obtain times of up to 88% shorter than those of CanIt-PRO to classify all e-mails in the database. Open-MaLBAS is implemented in Java language, under free software license, for free use. It is available on GitHub.
Large‐Scale Benchmarking of Intrusion Detection Datasets With GPU‐Accelerated Data Pipelines, Complexity Analysis, and Model Evaluation MVC Aragão, FAP Figueiredo, SB Mafra International Journal of Intelligent Systems 2026 (1), 9925751 , 2026 2026.0
Interactive Control System for Automated Guide Vehicles JPC Henriques, DN Teixeira, MBM Rosa, MJA Ribeiro, ER Neto, ... 2025 13th International Conference on Control, Mechatronics and Automation … , 2025 2025.0
A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification MVC Aragão, AG Afonso, RC Ferraz, RG Ferreira, SG Leite, ... Scientific Reports 15 (1), 17682 , 2025 2025.0 Citations: 21
A Sample-Based, Multi-Stage Machine Learning Pipeline for Scalable IoT Threat Detection MVC Aragão, TM Pereira, FAP de Figueiredo, SB Mafra IEEE Embedded Systems Letters , 2025 2025.0 Citations: 4
Dynamic‐Balancing AutoML for Imbalanced Tabular Data With Adaptive Resampling and Complexity‐Aware Analysis MVC Aragão, TM Pereira, MF Carvalho, FAP Figueiredo, SB Mafra International Journal of Intelligent Systems 2025 (1), 3986105 , 2025 2025.0 Citations: 2
Enhancing AutoML performance for imbalanced tabular data classification: A self-balancing pipeline MVC Aragão, M de Freitas Carvalho, T de Morais Pereira, ... 2024.0 Citations: 3
ML-based novelty detection and classification of security threats in IoT networks MVC Aragão, GP Ambrósio, FAP de Figueiredo presented at the Simpósio Bras. Telecomun. Process. Sinais, São José dos … , 2023 2023.0 Citations: 2
Análise de tráfego de rede com machine learning para identificaçao de ameaças a dispositivos IoT MVC Aragão, S Mafra, FAP de Figueiredo Proceedings of the 40th Brazilian Symposium on Telecommunications and Signal … , 2022 2022.0 Citations: 4
A study and evaluation of classifiers for anti-spam systems MVC Aragao, IC Ferreira, EM Oliveira, BT Kuehne, EM Moreira, ... IEEE Access 9, 157482-157498 , 2021 2021.0 Citations: 3
The development of the open machine-learning-based anti-spam (open-malbas) IC Ferreira, MVC Aragão, EM Oliveira, BT Kuehne, EM Moreira, ... IEEE Access 9, 138618-138632 , 2021 2021.0 Citations: 6
Factorial design analysis applied to the performance of SMS anti-spam filtering systems MVC Aragao, EP Frigieri, CA Ynoguti, AP Paiva Expert Systems with Applications 64, 589-604 , 2016 2016.0 Citations: 24
Otimizando o treinamento ea topologia de um decodificador de canal baseado em redes neurais MVC Aragão, SB Mafra, FAP de Figueiredo Polar 2, 1 , 0 Citations: 2
MOST CITED SCHOLAR PUBLICATIONS
Factorial design analysis applied to the performance of SMS anti-spam filtering systems MVC Aragao, EP Frigieri, CA Ynoguti, AP Paiva Expert Systems with Applications 64, 589-604 , 2016 2016.0 Citations: 24
A practical evaluation of AutoML tools for binary, multiclass, and multilabel classification MVC Aragão, AG Afonso, RC Ferraz, RG Ferreira, SG Leite, ... Scientific Reports 15 (1), 17682 , 2025 2025.0 Citations: 21
The development of the open machine-learning-based anti-spam (open-malbas) IC Ferreira, MVC Aragão, EM Oliveira, BT Kuehne, EM Moreira, ... IEEE Access 9, 138618-138632 , 2021 2021.0 Citations: 6
A Sample-Based, Multi-Stage Machine Learning Pipeline for Scalable IoT Threat Detection MVC Aragão, TM Pereira, FAP de Figueiredo, SB Mafra IEEE Embedded Systems Letters , 2025 2025.0 Citations: 4
Análise de tráfego de rede com machine learning para identificaçao de ameaças a dispositivos IoT MVC Aragão, S Mafra, FAP de Figueiredo Proceedings of the 40th Brazilian Symposium on Telecommunications and Signal … , 2022 2022.0 Citations: 4
Enhancing AutoML performance for imbalanced tabular data classification: A self-balancing pipeline MVC Aragão, M de Freitas Carvalho, T de Morais Pereira, ... 2024.0 Citations: 3
A study and evaluation of classifiers for anti-spam systems MVC Aragao, IC Ferreira, EM Oliveira, BT Kuehne, EM Moreira, ... IEEE Access 9, 157482-157498 , 2021 2021.0 Citations: 3
Dynamic‐Balancing AutoML for Imbalanced Tabular Data With Adaptive Resampling and Complexity‐Aware Analysis MVC Aragão, TM Pereira, MF Carvalho, FAP Figueiredo, SB Mafra International Journal of Intelligent Systems 2025 (1), 3986105 , 2025 2025.0 Citations: 2
ML-based novelty detection and classification of security threats in IoT networks MVC Aragão, GP Ambrósio, FAP de Figueiredo presented at the Simpósio Bras. Telecomun. Process. Sinais, São José dos … , 2023 2023.0 Citations: 2
Otimizando o treinamento ea topologia de um decodificador de canal baseado em redes neurais MVC Aragão, SB Mafra, FAP de Figueiredo Polar 2, 1 , 0 Citations: 2
Large‐Scale Benchmarking of Intrusion Detection Datasets With GPU‐Accelerated Data Pipelines, Complexity Analysis, and Model Evaluation MVC Aragão, FAP Figueiredo, SB Mafra International Journal of Intelligent Systems 2026 (1), 9925751 , 2026 2026.0
Interactive Control System for Automated Guide Vehicles JPC Henriques, DN Teixeira, MBM Rosa, MJA Ribeiro, ER Neto, ... 2025 13th International Conference on Control, Mechatronics and Automation … , 2025 2025.0