Published Fast: - If it's accepted, We aim to get your article published online in 48 hours.

Home / Articles

No Article found
Scalable Data Cleaning and Preprocessing for Batch ML Models Using PySpark
Author Name

M MADHU SRI, P SATHWIKA, D SAI ANUSHA, N PRAVEEN NAIK and Mr. O Srinivasa., M.Tech

Abstract

This project focuses on building a robust and scalable data cleaning and preprocessing framework using PySpark on Azure Databricks. The solution is designed to handle large-scale datasets efficiently, ensuring high-quality input data for batch machine learning models. By automating data wrangling tasks such as handling missing values, outlier detection, normalization, and feature encoding, the pipeline improves the accuracy and reliability of downstream ML models.

Key Features:

 

  • High-Performance Data Processing: Utilizes  PySpark on Azure Databricks to clean and preprocess massive datasets efficiently.
  • Automated Handling of Missing Data: Implements imputation techniques (mean,median,mode, KNN imputation) to ensure data completeness.
  • Outlier Detection & Treatment: Uses statistical methods (Z-score, IQR) and machine learning- based anomaly detection for data consistency.
  • Feature Engineering& Transformation: Supports one-hot encoding, label encoding, scaling, and PCA for optimized ML input features.
  • Batch Processing for ML Pipelines: Enables scalable preprocessing work flows for large datasets used in batch ML training.
  • ML Flow Integration for Data Versioning: Tracks preprocessed datasets and transformations for model reproducibility.

This solution streamlines data preparation for machine learning, ensuring high-quality,  structured input for better predictive performance.

 



Published On :
2025-06-07

Article Download :
Publish your academic thesis as a book with ISBN Contact – connectirj@gmail.com
Visiters Count :