Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications Ahmed Mahany, Heba Khaled, Nouh Sabri Elmitwally, Naif Aljohani, Said Ghoniemy Applied Sciences Switzerland, 2022 Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. However, this growth is insufficient to address these important phenomena in languages with limited resources. The use of cross-lingual models and translation of the well-known languages are acceptable alternatives. We also highlight the lack of consistent annotation guidelines and the shortcomings of the existing techniques, and suggest alternatives that may speed up progress in this research direction. Adding more syntactic features may alleviate the limitations of the existing techniques, such as cue ambiguity and detecting the discontinuous scopes. In some NLP applications, inclusion of a system that is negation- and speculation-aware improves performance, yet this aspect is still not addressed or considered an essential step.
Annotated Corpus with Negation and Speculation in Arabic Review Domain: NSAR Ahmed Mahany, Heba Khaled, Nouh Sabri Elmitwally, Naif Aljohani, Said Ghoniemy International Journal of Advanced Computer Science and Applications, 2022 —Negation and speculation detection are critical for Natural Language Processing (NLP) tasks, such as sentiment analysis, information retrieval, and machine translation. This paper presents the first Arabic corpus in the review domain annotated with negation and speculation. The Negation and Speculation Arabic Review (NSAR) corpus consists of 3K randomly selected review sentences from three well-known and benchmarked Arabic corpora. It contains reviews from different categories, including books, hotels, restaurants, and other products written in various Arabic dialects. The negation and speculation keywords have been annotated along with their linguistic scope based on the annotation guidelines reviewed by an expert linguist. The inter-annotator agreement between two independent annotators, Arabic native speakers, is measured using the Cohen’s Kappa coefficients with values of 95 and 80 for negation and speculation, respectively. Furthermore, 29% of this corpus includes at least one negation instance, while only 4% of this corpus contains speculative content. Therefore, the Arabic reviews focus more on negation structures rather than speculation. This corpus will be available for the Arabic research community to handle these critical phenomena 1 .
Supervised Learning for Negation Scope Detection n Arabic Texts Ahmed Mahany, Mohamed M. Fouad, Afnan Aloraini, Heba Khaled, Raheel Nawaz, et al. Proceedings 2021 IEEE 10th International Conference on Intelligent Computing and Information Systems Icicis 2021, 2021 The research in detecting the negation in Arabic is limited due to the unavailability of Arabic corpora targeting this phenomenon. The negation detection affects a set of subfields in the Arabic Natural Language Processing (ANLP), including sentiment analysis and medical information retrieval. Therefore, a corpus is manually annotated with negation for targeting this deficiency in the Modern Standard Arabic (MSA) and Classical Arabic (CA) texts. This corpus is collected from various sources, including King Saud University Corpus of Classical Arabic (KSUCCA) and Wikipedia. It includes texts from various topics, like religion, sports, science, biography, health, technology, education, and history. In addition, we propose a supervised-based learning system for the problem of negation scope detection in Arabic texts. Our system depends on Word2Vec and FastText word embeddings with two different classifiers: the Bidirectional Long Short-Term Memory (BiLSTM) and the Support Vector Machined (SVM) as a baseline system. The results show that one of the FastText-BiLSTM based models achieved a classification accuracy of 93% with F1 score of 89%. The fact that the results of the supervised learning are encouraging further proves the point that the treatment of the negation phenomenon is tractable.
ArWordVec: efficient word embedding models for Arabic tweets Mohammed M. Fouad, Ahmed Mahany, Naif Aljohani, Rabeeh Ayaz Abbasi, Saeed-Ul Hassan Soft Computing, 2020 One of the major advances in artificial intelligence nowadays is to understand, process and utilize the humans’ natural language. This has been achieved by employing the different natural language processing (NLP) techniques along with the aid of the various deep learning approaches and architectures. Using the distributed word representations to substitute the traditional bag-of-words approach has been utilized very efficiently in the last years for many NLP tasks. In this paper, we present the detailed steps of building a set of efficient word embedding models called ArWordVec that are generated from a huge repository of Arabic tweets. In addition, a new method for measuring Arabic word similarity is introduced that has been used in evaluating the performance of the generated ArWordVec models. The experimental results show that the performance of the ArWordVec models overcomes the recently available models on Arabic Twitter data for the word similarity task. In addition, two of the large Arabic tweets datasets are used to examine the performance of the proposed models in the multi-class sentiment analysis task. The results show that the proposed models are very efficient and help in achieving a classification accuracy ratio exceeding 73.86% with a high average F1 value of 74.15 .
Masdar: A novel sequence-to-sequence deep learning model for arabic stemming Mohammed M. Fouad, Ahmed Mahany, Iyad Katib Advances in Intelligent Systems and Computing, 2020 Preprocessing the input textual data is the main starting step in any Natural Language Processing (NLP) application. Word stemming, i.e. extracting the stem or root of the input word, is a vital process within the preprocessing step. In this process, some words like “player”, “playing”, and “played” are mapped to their stem “play”. In the English language, there are several algorithms and approaches that can be applied directly to handle this process. On the other hand, there are some trials for similar algorithms in Arabic, but all have weak performance due to the complexity of the language and the approaches used for building such algorithms. In this paper, we presented a novel deep learning-based model, called Masdar, for Arabic stemming. The proposed model leverages the power of the deep learning, especially the recurrent neural networks, in building an efficient Arabic stemmer that is capable of producing very accurate stems for most of the input words. Some experiments are conducted to compare the performance of the proposed model with the latest cited Arabic stemmers on a dataset of about 6000 Arabic word/stem pairs. The experimental results show that Masder outperformed the other stemmers. It can efficiently produce the correct stems with about 95% accuracy on the whole dataset and about 82% accuracy on the unseen test words.