Sainik Kumar Mahata

@iem.edu.in

Assistant Professor, Computer Science and Engineering
Institute of Engineering and Management

https://researchid.co/sainik.mahata

EDUCATION

M.Tech, B.E

RESEARCH INTERESTS

Natural Language Processing

Scopus Publications

359

Scholar Citations

Scholar h-index

Scholar i10-index

Scopus Publications

Consensus-Based Machine Translation for Code-Mixed Texts
Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay
Association for Computing Machinery (ACM)
Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.

Exploring the Role of Automated Video Inspection and Recognition in Security Enhancement
Darothi Sarkar, Monalisa Dey, Sahini Das, Soham Bangal, Aniket Kar, Sainik Kumar Mahata, and Anupam Mondal
IEEE
Security becomes the highest priority for people, organisations, and governments in a world that is getting more interconnected and complicated. The use of automated video inspection and recognition systems is becoming a very popular and effective method for strengthening security measures. This is a technology-driven system that integrates image processing, Artificial Intelligence, Machine Learning to identify specific objects, analyse their behaviour and movement and also recognize a specific trend to detect unusual activities. This comprehensive review paper explores and compares different technologies that have been used in automated video inspection and recognition systems. This survey paper aims to comprehensively examine several approaches used for automated video inspection. It explores the real-world applications and challenges of automated video inspection systems, providing a comprehensive grasp of this vital part of security and monitoring systems.

Design Evaluation and Uses of Paraphraser Content Generation
Soumyajit Chowdhury, Pawan Shaw, and Sainik Kumar Mahata
IEEE
Paraphrasing is a challenging and important job in NLP, which involves generating alternative versions of text that preserve the original meaning. Paraphrasing can benefit many NLP applications, such as summarization, translation, and sentiment analysis, by improving the diversity, quality, and readability of the generated text. However, paraphrasing is not a trivial task, as it requires a deep understanding of the semantics and syntax of the text, as well as the ability to produce fluent and natural language. Existing paraphrasing tools are often limited in their scope, accuracy, and flexibility, and they cannot handle complex and diverse texts. In this project, we present Paraphraser, a novel and comprehensive software tool that can perform paraphrasing on any text, using advanced natural language processing techniques, deep learning models, and rich language resources. Paraphraser can generate multiple paraphrased versions of text, ranging from subtle rewording to significant revisions, while maintaining the real meaning and style. Paraphraser can also optimize the paraphrased text for various purposes, such as avoiding plagiarism, ensuring message clarity, and reaching a wider audience. Paraphraser is a versatile and reliable tool that can be used in various domains, such as content marketing, academic writing, and SEO optimization, enhancing the originality and attractiveness of textual content. We demonstrate the effectiveness and usefulness of Paraphraser through various experiments and evaluations, and we show how Paraphraser can improve the performance of various NLP tasks and applications.

Text summarization implementing abstractive and extractive methods
Debjyoti Ghosh, Abhirup Mazumder, and Sainik Kumar Mahata
IEEE
How often do we come across paragraphs which contain important information but are too long to read? Most people tend to overlook humungous paragraphs at the expense of losing out on crucial information. This leads to a gap in topics which otherwise connect other significant concepts to make a meaningful learning experience - something we term as knowledge void. This report aims to highlight the importance of summarization using the two different methods of summarization – abstractive and extractive. We shall discuss the methods in detail including the methodologies, architectures and algorithms involved. This includes the preprocessing of data, the introduction of word embeddings, the application of algorithms like TextRank, building sequence-to-sequence models using LSTMs, applying encoder-decoder architecture and other advanced NLP techniques. We shall also evaluate our work using appropriate evaluation metrics. There has been experimentation using different approaches like unidirectional LSTMs, bidirectional LSTMs, a variety of tokenizers, and incorporation of attention layer to obtain the model with optimal accuracy and consistency.

Exploring Summarization of Scientific Tables: Analysing Data Preparation and Extractive to Abstractive Summary Generation

Simplification of English and Bengali Sentences for Improving Quality of Machine Translation
Sainik Kumar Mahata, Avishek Garain, Dipankar Das, and Sivaji Bandyopadhyay
Springer Science and Business Media LLC

Preparation of Sentiment tagged Parallel Corpus and Testing Its Effect on Machine Translation
Sainik Kumar Mahata, Amrita Chandra, Dipankar Das, and Sivaji Bandyopadhyay
Springer Singapore

Classification of COVID19 tweets using Machine Learning Approaches

Sentiment Classification of Code-Mixed Tweets using Bi-Directional RNN and Language Tags

Performance Gain in Low Resource MT with Transfer Learning: An Analysis concerning Language Families
Sainik Kumar Mahata, Subhabrata Dutta, Dipankar Das, and Sivaji Bandyopadhyay
ACM
Translation systems require a huge amount of parallel data to produce quality translations, but acquiring one for low-resource languages is difficult. To counter this, recent research has been shown to combine languages and use them to augment the low resource data, through transfer learning. While the gain in performance is apparent using transfer learning, we try to investigate the correlation between the performance gain and position of the concerned languages within a language family. We further probe and try to coordinate the performance gain with the degree of vocabulary sharing between the concerned languages.

Normalization of Numeronyms using NLP Techniques
Avishek Garain, Sainik Kumar Mahata, and Subhabrata Dutta
IEEE
This paper presents a method to apply Natural Language Processing for normalizing numeronyms to make them understandable by humans. We try to deal with the problem using two approaches, viz., semi-supervised approach and supervised approach. For the semi-supervised approach, we make use of the state of the art DamerauLevenshtein distance of words. We then apply Cosine Similarity for selection of the normalized text and reach greater accuracy in solving the problem. For the supervised approach, we used a deep learning architecture to solve the problem at hand. Our approach garners accuracy figures of 71% and 72% for Bengali and English (for the semi-supervised approach) and 89% for the supervised approach, respectively.

JUNLP at SemEval-2020 Task 9: Sentiment Analysis of Hindi-English code mixed data using Grid Search Cross Validation

JUNLP@Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags

Analyzing Code-Switching Rules for Englishâ€“Hindi Code-Mixed Text
Sainik Kumar Mahata, Sushnat Makhija, Ayushi Agnihotri, and Dipankar Das
Springer Singapore

Code-mixed to monolingual translation framework
Sainik Kumar Mahata, Soumil Mandal, Dipankar Das, and Sivaji Bandyopadhyay
ACM
The use of multilingualism among the new generation is widespread in the form of code-mixed data on social media, and therefore a robust translation system is required for catering to the novice and monolingual users. In this work, we present a translation framework that uses a translation-transliteration strategy for translating code-mixed data into their equivalent monolingual instances. One of the goals of this work is to translate a code-mixed source (written in Roman script) to a Bengali target (written in Devanagari script), where the source may contain English, along with transliterated Bengali. Finally, to convert the output to a more readable form, it is reordered using a target language model. The decisive advantage of the proposed framework is that it does not require a code-mixed to monolingual parallel corpus for training and decoding. On testing the framework, it achieved BLEU and TER scores of 16.47 and 55.45, respectively. Since the proposed framework deals with various sub-modules, we dive deeper into the importance of each of them, analyze the errors and finally, discuss some improvement strategies.

MTIL2017: Machine translation using recurrent neural network on statistical machine translation
Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay
Walter de Gruyter GmbH
Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.

JUMT at WMT2019 news translation task: A hybrid approach to machine translation for Lithuanian to English

Sentiment analysis at SEPLN (TASS)-2019: Sentiment analysis at tweet level using deep learning

JUCBNMT at WMT2018 News Translation Task: Character Based Neural Machine Translation of Finnish to English

BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora

WMT2016: A Hybrid Approach to Bilingual Document Alignment

Tamper detection of electrocardiographic signal using watermarked bio-hash code in wireless cardiology
Nilanjan Dey, Monalisa Dey, Sainik Kumar Mahata, Achintya Das, and Sheli Sinha Chaudhuri
Inderscience Publishers
The current globalised era is marked with a rapid increase in the use of wireless media to exchange information over globally distributed locations. This advancement and growth of technologically mediated information helps to provide medical care from a distant location by exchanging biomedical information amongst various hospitals and diagnostic centres across the world. However, while transmitting, the medical information becomes highly vulnerable to miscellaneous attacks like tampering and hacking. A watermark is added in the Electrocardiographic (ECG) signal to increase the level of security to help protect the integrity of the data and decrease the chances of wrong diagnosis. In this current work, a technique is proposed to detect undesirable modifications, if present, in a transmitted biomedical ECG signal. The proposed method is based on bio–hashing and reversible watermarking techniques.

Electrocardiogram feature based inter-human biometric authentication system
Monalisa Dey, Nilanjan Dey, Sainik Kumar Mahata, Sayan Chakraborty, Suvojit Acharjee, and Achintya Das
IEEE
Biometrics integrates various technologies to identify an individual by exploiting their physiological and behavioral characteristics, which are unique and measurable. This paper proposes a novel technique for the development of a robust and secure biometric authentication system. In this current work, an interhuman ECG-Hash code is generated by performing an inner product between the Electrocardiogram (ECG) feature matrices of two different individuals located remotely. The individuals will have each other's ECG features, stored in their database. The accuracy of the system increases as the authentication mechanism requires traits from both the individuals, amongst whom the transmission is taking place. Moreover, the use of ECG features as a biometric trait enhances the security aspects of the system as traits like fingerprints or facial features maybe compromised with age or otherwise.

RECENT SCHOLAR PUBLICATIONS

Consensus-Based Machine Translation for Code-Mixed Texts
SK Mahata, D Das, S Bandyopadhyay
ACM Transactions on Asian and Low-Resource Language Information Processing 2024

Text summarization implementing abstractive and extractive methods
D Ghosh, A Mazumder, SK Mahata
2023 7th International Conference on Electronics, Materials Engineering 2023

Exploring the Role of Automated Video Inspection and Recognition in Security Enhancement
D Sarkar, M Dey, S Das, S Bangal, A Kar, SK Mahata, A Mondal
2023 7th International Conference on Electronics, Materials Engineering 2023

Design Evaluation and Uses of Paraphraser Content Generation
S Chowdhury, P Shaw, SK Mahata
2023 7th International Conference on Electronics, Materials Engineering 2023

Exploring Summarization of Scientific Tables: Analysing Data Preparation and Extractive to Abstractive Summary Generation.
M Dey, SK Mahata, D Das
International Journal for Computers & Their Applications 30 (4) 2023

Breast Cancer Classification Using Deep Convolutional Neural Networks
M Dey, A Mondal, SK Mahata, D Sarkar
Proceedings of International Conference on Computational Intelligence, Data 2022

Sentiment Analysis using Machine Translation
SK Mahata, A Mondal, M Dey, D Sarkar
Applications of Machine intelligence in Engineering, 371-377 2022

An Automatic Summarization System to Understand the Impact of COVID-19 on Education
A Mondal, M Dey, SK Mahata, D Sarkar
Applications of Machine intelligence in Engineering, 379-386 2022

Simplification of English and Bengali sentences for improving quality of machine translation
SK Mahata, A Garain, D Das, S Bandyopadhyay
Neural Processing Letters 54 (4), 3115-3139 2022

Investigating the roles of sentiment in machine translation
SK Mahata, D Das, S Bandyopadhyay
Machine Translation 35 (4), 687-709 2021

Disease prediction from drug information using machine learning
S Das, S Kumar Mahata, A Das, K Deb
American Journal of Electronics & Communication 1 (4), 16-21 2021

Classification of COVID19 tweets using machine learning approaches
A Mondal, S Mahata, M Dey, D Das
Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop 2021

Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
S Mahata, D Das, S Bandyopadhyay
Proceedings of the First Workshop on Speech and Language Technologies for 2021

Preparation of sentiment tagged parallel corpus and testing its effect on machine translation
SK Mahata, A Chandra, D Das, S Bandyopadhyay
Proceedings of International Conference on Big Data, Machine Learning and 2021

Performance Gain in Low Resource MT with Transfer Learning: An Analysis concerning Language Families
SK Mahata, S Dutta, D Das, S Bandyopadhyay
Proceedings of the 12th Annual Meeting of the Forum for Information 2020

JUNLP@ ICON2020: Low Resourced Machine Translation for Indic Languages
S Mahata, D Das, S Bandyopadhyay
Proceedings of the 17th International Conference on Natural Language 2020

JUNLP@ Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
SK Mahata, D Das, S Bandyopadhyay
arXiv preprint arXiv:2010.10111 2020

Development of pos tagger for english-bengali code-mixed data
T Raha, SK Mahata, D Das, S Bandyopadhyay
arXiv preprint arXiv:2007.14576 2020

JUNLP@ SemEval-2020 Task 9: Sentiment analysis of Hindi-English code mixed data using grid search cross validation
A Garain, SK Mahata, D Das
arXiv preprint arXiv:2007.12561 2020

Junlp@ semeval-2020 task 9: Sentiment analysis of hindi-english code mixed data
A Garain, SK Mahata, D Das
arXiv preprint arXiv:2007.12561 2020

MOST CITED SCHOLAR PUBLICATIONS

Mtil2017: Machine translation using recurrent neural network on statistical machine translation
SK Mahata, D Das, S Bandyopadhyay
Journal of Intelligent Systems 2018
Citations: 60

Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages
S Mandal, SK Mahata, D Das
arXiv preprint arXiv:1803.04000 2018
Citations: 40

Tamper detection of electrocardiographic signal using watermarked bio–hash code in wireless cardiology
N Dey, M Dey, SK Mahata, A Das, SS Chaudhuri
International Journal of Signal and Imaging Systems Engineering 8 (1), 46-58 2015
Citations: 35

SMT vs NMT: a comparison over Hindi & Bengali simple sentences
SK Mahata, S Mandal, D Das, S Bandyopadhyay
arXiv preprint arXiv:1812.04898 2018
Citations: 29

Electrocardiogram feature based inter-human biometric authentication system
M Dey, N Dey, SK Mahata, S Chakraborty, S Acharjee, A Das
2014 International Conference on Electronic Systems, Signal Processing and 2014
Citations: 27

Code-mixed to monolingual translation framework
SK Mahata, S Mandal, D Das, S Bandyopadhyay
Proceedings of the 11th Annual Meeting of the Forum for Information 2019
Citations: 17

Wmt2016: A hybrid approach to bilingual document alignment
S Mahata, D Das, S Pal
Proceedings of the First Conference on Machine Translation: Volume 2, Shared 2016
Citations: 15

Classification of COVID19 tweets using machine learning approaches
A Mondal, S Mahata, M Dey, D Das
Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop 2021
Citations: 12

BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora
SB Sainik Kumar Mahata, Dipankar Das
10th Workshop on Building and Using Comparable Corpora, 56-59 2017
Citations: 12

Simplification of English and Bengali sentences for improving quality of machine translation
SK Mahata, A Garain, D Das, S Bandyopadhyay
Neural Processing Letters 54 (4), 3115-3139 2022
Citations: 9

Analyzing code-switching rules for english–hindi code-mixed text
SK Mahata, S Makhija, A Agnihotri, D Das
Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018 2020
Citations: 9

A Novel Approach of Steganography using Hill Cipher
SK Mahata, PM Anupam Mondal, Deepak Kumar
International Journal of Computer Application, 29-31 2013
Citations: 9

JUNLP@ Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
SK Mahata, D Das, S Bandyopadhyay
arXiv preprint arXiv:2010.10111 2020
Citations: 8

JUNLP@ SemEval-2020 Task 9: Sentiment analysis of Hindi-English code mixed data using grid search cross validation
A Garain, SK Mahata, D Das
arXiv preprint arXiv:2007.12561 2020
Citations: 8

Normalization of numeronyms using nlp techniques
A Garain, SK Mahata, S Dutta
2020 IEEE Calcutta Conference (CALCON), 7-9 2020
Citations: 8

Development of pos tagger for english-bengali code-mixed data
T Raha, SK Mahata, D Das, S Bandyopadhyay
arXiv preprint arXiv:2007.14576 2020
Citations: 7

Disease prediction from drug information using machine learning
S Das, S Kumar Mahata, A Das, K Deb
American Journal of Electronics & Communication 1 (4), 16-21 2021
Citations: 6

Sentiment analysis at sepln (tass)-2019: Sentiment analysis at tweet level using deep learning
A Garain, SK Mahata
arXiv preprint arXiv:1908.00321 2019
Citations: 6

A Novel Approach to Cryptography using Modified Substitution Cipher and Hybrid Crossover Technique
SK Mahata, S Nogaja, S Srivastava, M Dey, S Som
International Journal of Computer Applications 2013
Citations: 6