Home / Articles
SCALABLE DATA CLEANING AND PREPROCESSING FOR BATCH ML MODELS USING PYSPARK |
![]() |
Author Name O. Srinivas , M. Madhusri , P. Sathwika, D. Sai Anusha,N. Praveen Naik Abstract This project focuses on building a robust and scalable data cleaning and preprocessing framework using PySpark on Azure Databricks. The solution is designed to handle large-scale datasets efficiently, ensuring high-quality input data for batch machine learning models. By automating data wrangling tasks such as handling missing values, outlier detection, normalization, and feature encoding, the pipeline improves the accuracy and reliability of downstream ML models. Key Features:
This solution streamlines data preparation for machine learning, ensuring high-quality, structured input for better predictive performance. Published On : 2025-06-09 Article Download : ![]() |