Scalable Data Cleaning and Preprocessing for Batch ML Models Using PySpark

Home / Articles

No Article found

Scalable Data Cleaning and Preprocessing for Batch ML Models Using PySpark

Author Name

M MADHU SRI, P SATHWIKA, D SAI ANUSHA, N PRAVEEN NAIK and Mr. O Srinivasa., M.Tech

Abstract

This project focuses on building a robust and scalable data cleaning and preprocessing framework using PySpark on Azure Databricks. The solution is designed to handle large-scale datasets efficiently, ensuring high-quality input data for batch machine learning models. By automating data wrangling tasks such as handling missing values, outlier detection, normalization, and feature encoding, the pipeline improves the accuracy and reliability of downstream ML models.

Key Features:

High-Performance Data Processing: Utilizes PySpark on Azure Databricks to clean and preprocess massive datasets efficiently.
Automated Handling of Missing Data: Implements imputation techniques (mean,median,mode, KNN imputation) to ensure data completeness.
Outlier Detection & Treatment: Uses statistical methods (Z-score, IQR) and machine learning- based anomaly detection for data consistency.
Feature Engineering& Transformation: Supports one-hot encoding, label encoding, scaling, and PCA for optimized ML input features.
Batch Processing for ML Pipelines: Enables scalable preprocessing work flows for large datasets used in batch ML training.
ML Flow Integration for Data Versioning: Tracks preprocessed datasets and transformations for model reproducibility.

This solution streamlines data preparation for machine learning, ensuring high-quality, structured input for better predictive performance.

Published On :
2025-06-07

Article Download :

International Research Journal of Education and Technology

Latest News

Important Links

Home / Articles

Key Features: