A predictive analysis of the COVID-19 pandemic for traditional and tree-based regression algorithms Hari Singh, Piyush Sewal, Dinesh Chander Verma Impact of Digital Solutions for Improved Healthcare Delivery, 2024 A lot of works exist in the literature that compares regression algorithms on different datasets. This chapter presents a model that uses best subset selection approach for the predictors and performs an exhaustive empirical comparison of eight regression algorithms Linear Regression, Multi-Linear Regression, Polynomial Regression, K-Nearest Neighbors, Lasso, Ridge, Decision Tree, Gradient Boost Tree, and Random Forest Regression algorithms on various predictors from Covid-19 dataset. The model is evaluated for train accuracy on metrics R2, Root Mean Square Error, and Mean Absolute Error. The test R2 and adjusted-R2 metrics evaluate the model on cross-validation prediction test errors. The predicted values of dependent variables are checked for similarity and validation using statistical z-test.
Algorithmic Proficiency in Spark Configuration Tuning: An Empirical Study using Execution Time Metrics across Varied Workloads Piyush Sewal, Hari Singh Procedia Computer Science, 2024 In the realm of big data, where datasets of immense scale pose processing challenges, distributed processing platforms like open-source Apache Spark have emerged to address these issues. Spark’s internal configuration parameters exert varying impacts on execution times based on job characteristics, making manual optimization daunting. The core focus of this study lies in optimizing Spark’s internal configurations, with specific attention directed towards three types of workloads: Iterative-intensive, Memory-intensive, and CPU-intensive. Employing Grid Search, Random Search, and Evolutionary Optimization algorithms yields substantial execution time reductions: 23.24% with Grid Search, 19.71% with Random Search, and 23.06% with Evolutionary Optimization. Notably, Evolutionary Optimization achieves optimal configurations approximately 29% faster than Grid Search. While Random Search and Evolutionary Optimization share similar time requirements, Random Search’s execution time reduction for a given Spark workload is relatively lower. This research sheds light on algorithmic configuration tuning intricacies and its influence on Spark workload execution times, contributing to the exploration of optimizing big data processing platforms.
A Machine Learning Approach for Predicting Execution Statistics of Spark Application Piyush Sewal, Hari Singh Pdgc 2022 2022 7th International Conference on Parallel Distributed and Grid Computing, 2022 Apache Spark is one of the most popular, widely used and open-source distributed processing framework that can process huge site datasets in time efficient manner due to its in-memory computational capabilities. However, there are several factors that can affect the performance of an application which include the nature and size of the input dataset, computational capability of the system and nature and design of the algorithm. Hence, there are different parameters that are required to correctly predict the execution statistics of a Spark application which include execution time of jobs, stages and tasks, memory requirement and usage at the execution level and I/O cost in the form of read/ write shuffling of data. To address these challenges, a simulation and machine learning based prediction model is presented in this paper that takes only a few initial samples of execution statistics and predicts the performance and execution statistics of the Spark application with high accuracy. The proposed model is evaluated on the Wordcount application and Spark standalone mode and accuracy metrics show that the proposed model achieves high accuracy in predicting execution statistics.
A Critical Analysis of Apache Hadoop and Spark for Big Data Processing Piyush Sewal, Hari Singh Proceedings of IEEE International Conference on Signal Processing Computing and Control, 2021 The emergence of big data processing platforms that can work globally in an integrated manner and process the huge datasets efficiently has become very significant. A critical analysis of two big data processing platforms, Apache Hadoop MapReduce and Apache Spark, has been done in this paper. Earlier Hadoop MapReduce was one of the most popular platforms for batch-processing of huge size datasets but variation in the nature of data from static to dynamic, Apache Spark proves to be better for iterative jobs and live data streams. This paper aims to critically compare and analyze Hadoop-l.x, 2. x and 3. x, Spark-l.x, 2. x and 3. x on well-known key parameters like components, storage system, resource management, fault tolerance, data processing, scalability and performance etc.
RECENT SCHOLAR PUBLICATIONS
A Predictive Analysis of the COVID-19 Pandemic for Traditional and Tree-Based Regression Algorithms H Singh, P Sewal, DC Verma Impact of Digital Solutions for Improved Healthcare Delivery, 303-340 , 2025 2025 Citations: 2
Performance optimization of Spark MLlib workloads using cost efficient RICG model on exponential projective sampling P Sewal, H Singh Cluster Computing 27 (8), 10569-10588 , 2024 2024 Citations: 5
Utilizing Twitter data and NLP to analyze and predict public sentiment trends in mental health T Gupta, A Sharma, Aryan, K Rana, P Sewal The International Conference on Recent Trends in Communication & Intelligent … , 2024 2024 Citations: 1
Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach P Sewal, H Singh Multimedia Tools and Applications 83 (15), 44047-44066 , 2024 2024 Citations: 11
Performance comparison of apache spark and hadoop for machine learning based iterative GBTR on HIGGS and covid-19 datasets P Sewal, H Singh Scalable Computing: Practice and Experience 25 (3), 1373-1386 , 2024 2024 Citations: 12
Improving Execution Workloads in In-Memory Distributed Computing Platform–SPARK P Sewal, H Singh Jaypee University of Information Technology, Solan, HP , 2024 2024
Algorithmic proficiency in spark configuration tuning: An empirical study using execution time metrics across varied workloads P Sewal, H Singh Procedia Computer Science 235, 2307-2317 , 2024 2024 Citations: 2
A machine learning approach for predicting execution statistics of spark application P Sewal, H Singh 2022 Seventh International Conference on Parallel, Distributed and Grid … , 2022 2022 Citations: 6
A critical analysis of apache hadoop and spark for big data processing P Sewal, H Singh 2021 6th International Conference on Signal Processing, Computing and … , 2021 2021 Citations: 33
MOST CITED SCHOLAR PUBLICATIONS
A critical analysis of apache hadoop and spark for big data processing P Sewal, H Singh 2021 6th International Conference on Signal Processing, Computing and … , 2021 2021 Citations: 33
Performance comparison of apache spark and hadoop for machine learning based iterative GBTR on HIGGS and covid-19 datasets P Sewal, H Singh Scalable Computing: Practice and Experience 25 (3), 1373-1386 , 2024 2024 Citations: 12
Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach P Sewal, H Singh Multimedia Tools and Applications 83 (15), 44047-44066 , 2024 2024 Citations: 11
A machine learning approach for predicting execution statistics of spark application P Sewal, H Singh 2022 Seventh International Conference on Parallel, Distributed and Grid … , 2022 2022 Citations: 6
Performance optimization of Spark MLlib workloads using cost efficient RICG model on exponential projective sampling P Sewal, H Singh Cluster Computing 27 (8), 10569-10588 , 2024 2024 Citations: 5
A Predictive Analysis of the COVID-19 Pandemic for Traditional and Tree-Based Regression Algorithms H Singh, P Sewal, DC Verma Impact of Digital Solutions for Improved Healthcare Delivery, 303-340 , 2025 2025 Citations: 2
Algorithmic proficiency in spark configuration tuning: An empirical study using execution time metrics across varied workloads P Sewal, H Singh Procedia Computer Science 235, 2307-2317 , 2024 2024 Citations: 2
Utilizing Twitter data and NLP to analyze and predict public sentiment trends in mental health T Gupta, A Sharma, Aryan, K Rana, P Sewal The International Conference on Recent Trends in Communication & Intelligent … , 2024 2024 Citations: 1
Improving Execution Workloads in In-Memory Distributed Computing Platform–SPARK P Sewal, H Singh Jaypee University of Information Technology, Solan, HP , 2024 2024