Computer Vision and Pattern Recognition, Artificial Intelligence
61
Scopus Publications
6911
Scholar Citations
35
Scholar h-index
58
Scholar i10-index
Scopus Publications
ST-GCVA: Hierarchical graph-based spatio-temporal reasoning for robust violence detection Mustaqeem Khan, Jamil Alkilani, Wasim Abusafia, Jamil Ahmad, Farman Ullah, Naqqash Dilshad Intelligent Systems with Applications, 2026 Detecting violent activities in unconstrained video streams remains a critical yet challenging task for intelligent surveillance systems. While two-stream architectures leveraging RGB and Optical Flow have shown promise, existing approaches often rely on sequential temporal modeling and fixed spatial attention mechanisms, limiting their ability to capture localized violent interactions and long-range temporal dependencies. In this work, we propose ST-GCVA , a hierarchically structured spatio-temporal framework in which video dynamics are modeled at multiple levels, combining local temporal encoding for short-term motion patterns with graph-based global temporal reasoning for long-range frame dependencies within a dual-stream architecture. Unlike conventional models that treat temporal modeling sequentially, we decouple temporal reasoning into two complementary stages: (1) local temporal encoding via Temporal Convolutional Networks (TCNs) to capture short-term motion dynamics, and (2) global relational modeling via Graph Convolutional Networks (GCNs), where video frames are represented as graph nodes to enable non-sequential long-range dependency propagation. Furthermore, we incorporate deformable multi-head attention to dynamically localize salient violent interaction regions, improving robustness in cluttered and crowded environments. This unified spatial–temporal-relational optimization allows the proposed framework to reason jointly about where violence occurs, how it evolves, and which frame relationships are most discriminative. Extensive experiments on RWF-2000, Hockey-Fight, and Movies-Fight benchmarks demonstrate that ST-GCVA achieves state-of-the-art performance while maintaining computational efficiency with only 8.95M parameters, 12.3 GFLOPs, and real-time inference at 45 FPS. Comprehensive ablation studies validate the necessity and synergistic interaction of the proposed components. The results highlight the effectiveness of hierarchical temporal modeling and graph-based reasoning for structured violence understanding on established benchmarks. Preliminary evaluation under challenging real-world conditions reveals performance limitations that motivate future work on domain adaptation and multi-modal robustness. • Proposes a hierarchical TCN–GCN framework for structured spatio-temporal violence reasoning. • Introduces task-adaptive deformable attention for localized modeling of violent interactions. • Reformulates temporal modeling as graph-based relational reasoning beyond sequential LSTMs. • Unifies spatial localization and global temporal dependency modeling in a dual-stream architecture. • Achieves state-of-the-art performance with only 8.95M parameters on benchmark datasets.
AI-Driven Digital Twin Models for Wireless Endoscopic Gastrointestinal Monitoring Mustaqeem Khan, Jamil Ahmad, Yasir Mahmood 2026 7th International Conference on Advancements in Computational Sciences Icacs 2026, 2026 Digital twins integrate analytics and human expertise to create models that support informed healthcare decisions. In resource-limited areas, access to medical care is often limited by a lack of facilities, skilled professionals, and transportation. Wireless capsule endoscopy (WCE), being portable, reliable, and easy to use, has emerged as a superior alternative to traditional endoscopy, particularly for patients in remote areas. However, WCE generates large volumes of video data, requiring significant computation to analyze and extract relevant information. To address this challenge, we propose a video summarization scheme that identifies key frames while removing redundant content, preserving essential diagnostic information. Furthermore, we introduce a deep learning-based digital gastroenterologist twin for the automated classification of stomach-related pathological findings. The system reduces storage requirements without compromising critical information, enabling gastroenterologists to provide remote support. Experimental results demonstrate a performance improvement of at least 3% compared to state-of-the-art deep learning techniques, highlighting its potential for intelligent, scalable, and efficient gastroenterology care.
Distilling Knowledge to Efficient Transformer for Semi-Supervised Citrus Maturity Detection using Consumer UAVs Jamil Ahmad, Mustaqeem Khan, Wail Gueaieb, Abdulmotaleb El Saddik, Giulia De Masi, Fakhri Karray IEEE Transactions on Consumer Electronics, 2026 Accurate detection of citrus fruit maturity is critical for optimizing harvest schedules and maximizing yield. Consumer-grade unmanned aerial vehicles (UAVs) have emerged as cost-effective alternatives to traditional methods for detecting maturity, which rely on labor-intensive manual inspections. This paper presents a two-step, semi-supervised approach leveraging knowledge distillation (KD) and transfer learning for citrus maturity detection in UAV images. Specifically, we combine teacher-filtered pseudo-labels with a consistency-guided feature distillation signal to exploit abundant unlabeled UAV frames while using only a small labeled seed set. Firstly, a consistency-guided KD transfers knowledge from a pretrained detection transformer with collaborative hybrid assignment training (Co-DETR) to a lightweight student network by exploiting a small labeled and a large unlabeled dataset. The student network (Cit-DETR) is based on the highly efficient detection transformer (RT-DETR) having a ResNet18 backbone with selective kernel blocks and the hybrid encoder module. Step 2 uses a small labeled augmented dataset with maturity labels to fine-tune the Cit-DETR model for maturity detection. Experimental results on a custom UAV-captured citrus dataset demonstrate the effectiveness of our method, achieving 86.2% average precision in citrus detection and 91.0% mean average precision in ripeness detection. The model has been further optimized for real-time inference on edge devices or UAVs, enabling precision agriculture applications.
AG-CLIP: Attribute-Guided CLIP for Zero-Shot Fine-Grained Recognition Jamil Ahmad, Mustaqeem Khan, Wail Guiaeab, Abdulmotaleb Elsaddik, Giulia De Masi, Fakhri Karray IEEE Open Journal of the Computer Society, 2026 Zero-shot fine-grained recognition is challenging due to high visual similarities between classes and the inferior encoding of fine-grained features in embedding models. In this work, we present an attribute-guided Contrastive Language-Image Pre-training (AG-CLIP) model with an additional attribute encoder. Our approach first identifies relevant visual attributes from the textual class descriptions using an attribute mining module leveraging a large language model (LLM) GPT-4o. The attributes are then used to construct prompts for an open vocabulary object/region detector to extract relevant corresponding image regions. The attribute text, along with focused regions of the input, then guides the CLIP model to focus on these discriminative attributes during fine-tuning through a context-attribute fusion module. Our attribute-guided attention mechanism allows CLIP to effectively disambiguate fine-grained classes by highlighting their distinctive attributes without requiring fine-tuning or additional training data on unseen classes. We evaluate our approach on the CUB-200-2011 and plant disease datasets, achieving 73.3% and 84.6% accuracy, respectively. Our method achieves state-of-the-art zero-shot performance, outperforming prior methods that rely on external knowledge bases or complex meta-learning strategies. The strong results demonstrate the effectiveness of injecting generic attribute awareness into powerful vision-language models like CLIP for tackling fine-grained recognition in a zero-shot manner.
ViolenceNet: Multi-Scale Transformer with Joint Features Understanding Mustaqeem Khan, Wasim Abusafia, Jamil Alkilani, Mazin Mohamed, Amar Ez-Zyn, Hasan Zokrait, Jamil Ahmad Digest of Technical Papers IEEE International Conference on Consumer Electronics, 2026 Public safety increasingly depends on systems that can detect and respond to trouble as it occurs. Yet, many current models fail to fully integrate what the video shows, how things move, and what the audio sounds like, especially in critical situations. We introduce a multimodal framework that brings together visual, motion, and audio signals into a single, learned representation. Two tasks make this work significant: a gated fusion module that adjusts the importance of each modality at any given moment, and a multi-scale transformer that captures the timing and context of events. This tighter integration leads to a clearer understanding of rapidly changing scenes and more informed decisions. On the XD-Violence benchmark, the method achieves an average precision of 84.85% based on frame-by-frame analysis, surpassing prior state-of-the-art results.
Unleashing Creativity in the Metaverse: Generative AI and Multimodal Content Abdulmotaleb El Saddik, Jamil Ahmad, Mustaqeem Khan, Saad Abouzahir, Wail Gueaieb ACM Transactions on Multimedia Computing Communications and Applications, 2025 The metaverse presents an emerging creative expression and collaboration frontier where generative artificial intelligence (GenAI) can play a pivotal role with its ability to generate multimodal content from simple prompts. These prompts allow the metaverse to interact with GenAI, where context information, instructions, input data, or even output indications constituting the prompt can come from within the metaverse. However, their integration poses challenges regarding interoperability, lack of standards, scalability, and maintaining a high-quality user experience. This article explores how GenAI can productively assist in enhancing creativity within the contexts of the metaverse and unlock new opportunities. We provide a technical, in-depth overview of the different generative models for image, video, audio, and 3D content within the metaverse environments. We also explore the bottlenecks, opportunities, and innovative applications of GenAI from the perspectives of end users, developers, service providers, and AI researchers. This survey commences by highlighting the potential of GenAI for enhancing the metaverse experience through dynamic content generation to populate massive virtual worlds. Subsequently, we shed light on the ongoing research practices and trends in multimodal content generation, enhancing realism and creativity and alleviating bottlenecks related to standardization, computational cost, privacy, and safety. Last, we share insights into promising research directions toward the integration of GenAI with the metaverse for creative enhancement, improved immersion, and innovative interactive applications.
Residual-Enhanced YOLO with Motion-Blur Augmentation for UAV-Based Weed Detection Jamil Ahmad, Sara Mahmoud, Mariam Mahmoud, Lama Elfateh, Mustaqeem Khan 19th IEEE International Conference on Application of Information and Communication Technologies Aict 2025 Conference Proceedings, 2025 Unmanned Aerial Vehicles (UAVs) combined with deep learning models have transformed weed detection in precision agriculture, yet challenges remain in adapting to field variability, limited datasets, and motion blur from UAV flight. While CNNs and Transformer-based methods achieve high accuracy, their deployment often struggles with real-time constraints. YOLO architectures, by contrast, offer a strong balance between inference speed and detection accuracy, making them suitable for in-field use. In this paper, we propose YOLO-Res, a residual-skip enhanced YOLO model tailored for UAV-based weed detection in maize fields under true field conditions. The model incorporates motion-blur augmentation during training to simulate UAV-induced artifacts and leverages residual connections to strengthen feature propagation across scales, thereby improving robustness to degraded inputs. Evaluations on the WeedsGalore dataset demonstrate that YOLO-Res consistently outperforms baseline YOLO variants, achieving superior detection accuracy even under severe blur and varying crop–weed densities, while maintaining real-time inference capability.
Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices Mustaqeem Khan, Jamil Ahmad, Wail Gueaieb, Giulia De Masi, Fakhri Karray, Abdulmotaleb El Saddik IEEE Transactions on Consumer Electronics, 2025 The field of Multimodal Emotion Recognition (MER) has made considerable advancements in recent years; however, the opportunity to leverage the synergistic relationships between different modalities remains largely untapped. This paper introduces an MER approach employing a Joint Multi-Scale Multimodal Transformer (JMMT) with recursive cross-attention for naturalistic recognition of emotions by enhancing and capturing inter- and intra-modal relationships across both (visual and audio) modalities. We compute multi-scale attention weights based on cross-correlations between multi-scale joint representations of combined and individual cues to capture inter and intra-modal dynamics. As a result of individual modalities, recursive inputs are fed back during the fusion for further refinement of features. Our JMMT model presents a cost-effective solution for consumer devices by capturing synergistic characteristics across visual and audio inputs. The JMMT model outperforms the state-of-the-art (SOTA) methods in MER systems, which were evaluated by IEMOCAP and MELD datasets.
Context-Aware Detection and Grading of Intracranial Aneurysms in DSA Images Jamil Ahmad, Khalid Malik, Farman Ullah, Kaleem Ahmad, Mustaqeem Khan, Ghaus Malik 2025 7th Computing Communications and Iot Applications Conference Comcomap 2025, 2025 Tiny aneurysms in digital subtraction angiography (DSA) are challenging to detect and classify due to their low contrast, complex vascular backgrounds, and small size. We propose a three-stage, context-aware framework for automated aneurysm analysis. First, aneurysm candidates are localized using an RT-DETR detector, which provides accurate anchor-free detection and robust sensitivity to small objects. Second, the detected regions, including the aneurysm patch, are expanded to include the surrounding vessel context and are enhanced using a pretrained RealESRGAN model to restore vascular detail and improve geometric fidelity without additional training. Third, each super-resolved patch is passed to a modified ResNet50 classifier for categorical severity grading (mild, moderate, severe, critical). The classifier integrates local and contextual cues to better capture risk-related morphology beyond lesion size alone. The framework is evaluated on a 2D DSA dataset annotated with bounding boxes and severity labels, reporting both detection and multi-class classification performance. Results show that combining RealESRGAN-based enhancement and context-aware classification yields substantial gains in accuracy and recall, enabling more reliable and clinically interpretable aneurysm assessment.
ST-GCVA: Hierarchical graph-based spatio-temporal reasoning for robust violence detection M Khan, J Alkilani, W Abusafia, J Ahmad, F Ullah, N Dilshad Intelligent Systems with Applications, 200676 , 2026 2026
FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision M Shafique, N Rahim, J Ahmad, MR Siadat, K Malik, G Malik 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), 1-5 , 2026 2026 Citations: 1
AI-Driven Digital Twin Models for Wireless Endoscopic Gastrointestinal Monitoring M Khan, J Ahmad, Y Mahmood 2026 7th International Conference on Advancements in Computational Sciences … , 2026 2026
ViolenceNet: Multi-Scale Transformer with Joint Features Understanding M Khan, W Abusafia, J Alkilani, M Mohamed, A Ez-Zyn, H Zokrait, ... 2026 IEEE International Conference on Consumer Electronics (ICCE), 1-6 , 2026 2026
Distilling Knowledge to Efficient Transformer for Semi-Supervised Citrus Maturity Detection using Consumer UAVs J Ahmad, M Khan, W Gueaieb, A El Saddik, G De Masi, F Karray IEEE Transactions on Consumer Electronics , 2026 2026 Citations: 1
AG-CLIP: Attribute-guided CLIP for Zero-shot Fine-grained Recognition J Ahmad, M Khan, W Guiaeab, A Elsaddik, G De Masi, F Karray IEEE Open Journal of the Computer Society , 2026 2026 Citations: 1
Leveraging model explainability and fine-grained cutmix augmentation for robust detection of apricot diseases in UAV images J Ahmad, W Gueaieb, A El Saddik, G De Masi, F Karray Expert Systems with Applications 296, 128946 , 2026 2026 Citations: 3
Context-Aware Detection and Grading of Intracranial Aneurysms in DSA Images J Ahmad, K Malik, F Ullah, K Ahmad, M Khan, G Malik 2025 Computing, Communications and IoT Applications (ComComAp), 324-329 , 2025 2025
Residual-Enhanced YOLO with Motion-Blur Augmentation for UAV-Based Weed Detection J Ahmad, S Mahmoud, M Mahmoud, L Elfateh, M Khan 2025 IEEE 19th International Conference on Application of Information and … , 2025 2025
Unleashing Creativity in the Metaverse: Generative AI and Multimodal Content A El Saddik, J Ahmad, M Khan, S Abouzahir, W Gueaieb ACM Transactions on Multimedia Computing, Communications and Applications 21 … , 2025 2025 Citations: 37
Joint multi-scale multimodal transformer for emotion using consumer devices M Khan, J Ahmad, W Gueaieb, G De Masi, F Karray, A El Saddik IEEE Transactions on Consumer Electronics 71 (1), 1092-1101 , 2025 2025 Citations: 31
Knowledge-infused learning for fine-grained plant disease recognition J Ahmad, W Gueaieb, A El Saddik, G De Masi, F Karray 2024 IEEE International conference on image processing (ICIP), 395-401 , 2024 2024 Citations: 1
Yield estimation and health assessment of temperate fruits: A modular framework J Ahmad, W Gueaieb, A El Saddik, G De Masi, F Karray Engineering Applications of Artificial Intelligence 136, 108871 , 2024 2024 Citations: 12
Artificial Intelligence-based intrusion detection system for V2V communication in vehicular adhoc networks A Khalil, H Farman, MM Nasralla, B Jan, J Ahmad Ain Shams Engineering Journal 15 (4), 102616 , 2024 2024 Citations: 44
Enabling consumer UAVs for precision agriculture applications: A case study of yield estimation J Ahmad, W Gueaieb, A El Saddik, G De Masi, F Karray 2024 IEEE International conference on consumer electronics (ICCE), 1-6 , 2024 2024 Citations: 5
Skin-former: mobile-friendly transformer for skin lesion diagnosis M Khan, J Ahmad, A El Saddik, W Gueaieb 2024 IEEE International Conference on Consumer Electronics (ICCE), 1-6 , 2024 2024 Citations: 14
Drone-HAT: Hybrid attention transformer for complex action recognition in drone surveillance videos M Khan, J Ahmad, A El Saddik, W Gueaieb, G De Masi, F Karray Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2024 2024 Citations: 32
StrokeNet: An automated approach for segmentation and rupture risk prediction of intracranial aneurysm M Irfan, KM Malik, J Ahmad, G Malik Computerized Medical Imaging and Graphics 108, 102271 , 2023 2023 Citations: 22
Prognosis prediction in COVID-19 patients through deep feature space reasoning J Ahmad, AKJ Saudagar, KM Malik, MB Khan, A AlTameem, ... Diagnostics 13 (8), 1387 , 2023 2023 Citations: 4
Enabling automation and edge intelligence over resource constraint IoT devices for smart home M Nasir, K Muhammad, A Ullah, J Ahmad, SW Baik, M Sajjad Neurocomputing 491, 494-506 , 2022 2022 Citations: 95
MOST CITED SCHOLAR PUBLICATIONS
Action recognition in video sequences using deep bi-directional LSTM with CNN features A Ullah, J Ahmad, K Muhammad, M Sajjad, SW Baik IEEE access 6, 1155-1166 , 2017 2017 Citations: 986
Convolutional neural networks based fire detection in surveillance videos K Muhammad, J Ahmad, I Mehmood, S Rho, SW Baik Ieee Access 6, 18174-18183 , 2018 2018 Citations: 702
Efficient deep CNN-based fire detection and localization in video surveillance applications K Muhammad, J Ahmad, Z Lv, P Bellavista, P Yang, SW Baik IEEE Transactions on Systems, Man, and Cybernetics: Systems 49 (7), 1419-1434 , 2018 2018 Citations: 658
Early fire detection using convolutional neural networks during surveillance for effective disaster management K Muhammad, J Ahmad, SW Baik Neurocomputing 288, 30-42 , 2018 2018 Citations: 640
Speech emotion recognition from spectrograms with deep convolutional neural network AM Badshah, J Ahmad, N Rahim, SW Baik 2017 international conference on platform technology and service (PlatCon), 1-5 , 2017 2017 Citations: 588
Secure surveillance framework for IoT systems using probabilistic image encryption K Muhammad, R Hamza, J Ahmad, J Lloret, H Wang, SW Baik IEEE Transactions on Industrial Informatics 14 (8), 3679-3689 , 2018 2018 Citations: 347
Deep learning methods and applications J Ahmad, H Farman, Z Jan Deep learning: convergence to big data analytics, 31-42 , 2019 2019 Citations: 265
Attention induced multi-head convolutional neural network for human activity recognition ZN Khan, J Ahmad Applied soft computing 110, 107671 , 2021 2021 Citations: 236
Deep features-based speech emotion recognition for smart affective services AM Badshah, N Rahim, N Ullah, J Ahmad, K Muhammad, MY Lee, ... Multimedia Tools and Applications 78 (5), 5571-5589 , 2019 2019 Citations: 230
CISSKA-LSB: color image steganography using stego key-directed adaptive LSB substitution method K Muhammad, J Ahmad, NU Rehman, Z Jan, M Sajjad Multimedia Tools and Applications 76 (6), 8597-8626 , 2017 2017 Citations: 168
A Secure Method for Color Image Steganography using Gray-Level Modification and Multi-level Encryption K Muhammad, J Ahmad, ZJ Haleem Farman, M Sajjad, SW Baik KSII Transactions on Internet and Information Systems 9 (5), 1938-1962 , 2015 2015 Citations: 113
Internet of energy: Opportunities, applications, architectures and challenges in smart industries Y Shahzad, H Javed, H Farman, J Ahmad, B Jan, M Zubair Computers & Electrical Engineering 86, 106739 , 2020 2020 Citations: 108
Visual features based boosted classification of weeds for real-time selective herbicide sprayer systems J Ahmad, K Muhammad, I Ahmad, W Ahmad, ML Smith, LN Smith, ... Computers in Industry 98, 23-33 , 2018 2018 Citations: 99
Enabling automation and edge intelligence over resource constraint IoT devices for smart home M Nasir, K Muhammad, A Ullah, J Ahmad, SW Baik, M Sajjad Neurocomputing 491, 494-506 , 2022 2022 Citations: 95
Image steganography for authenticity of visual contents in social networks K Muhammad, J Ahmad, S Rho, SW Baik Multimedia Tools and Applications 76 (18), 18985-19004 , 2017 2017 Citations: 90
Disease detection in plum using convolutional neural network under true field conditions J Ahmad, B Jan, H Farman, W Ahmad, A Ullah Sensors 20 (19), 5569 , 2020 2020 Citations: 84
Analytical network process based optimum cluster head selection in wireless sensor network H Farman, H Javed, B Jan, J Ahmad, S Ali, FN Khalil, M Khan PLoS One 12 (7), e0180848 , 2017 2017 Citations: 68
Medical image retrieval with compact binary codes generated in frequency domain using highly reactive convolutional features J Ahmad, K Muhammad, SW Baik Journal of medical systems 42 (2), 24 , 2018 2018 Citations: 66
A novel image steganographic approach for hiding text in color images using HSI color model K Muhammad, J Ahmad, H Farman, M Zubair arXiv preprint arXiv:1503.00388 , 2015 2015 Citations: 66
Endoscopic image classification and retrieval using clustered convolutional features J Ahmad, K Muhammad, MY Lee, SW Baik Journal of medical systems 41 (12), 196 , 2017 2017 Citations: 64