An Empirical Study on the Impact of Feature Scaling and Encoding Strategies on Machine Learning Regression Pipelines

Guevara Ananta Toer, Gwanpil Kim

Abstract


Data preprocessing is a critical yet often underestimated component of Machine Learning (ML) regression pipelines. While prior studies have largely focused on algorithm selection and model architecture, the combined impact of feature scaling and categorical encoding strategies within end-to-end regression pipelines remains insufficiently explored. This study presents an empirical evaluation of how different preprocessing configurations influence regression model performance. Three regression algorithms, Linear Regression, Random Forest Regression, and Gradient Boosting Regression are evaluated in combination with multiple feature scaling methods (Min–Max, Standard, and Robust scaling) and categorical encoding techniques (One-Hot and Ordinal encoding). Experiments are conducted on a real-world car sales dataset comprising 50,000 records, using a k-fold cross-validation framework to ensure robust performance estimation. Model performance is assessed primarily using mean R², supported by RMSE and MAE as error-based metrics. The results demonstrate that ensemble-based models, particularly Gradient Boosting and Random Forest, consistently outperform Linear Regression across all preprocessing configurations. Feature scaling shows limited influence on ensemble model performance, whereas categorical encoding plays a more significant role, with One-Hot Encoding yielding higher predictive accuracy and lower error dispersion than Ordinal Encoding. Overall, the findings highlight that model choice is the dominant determinant of regression performance, followed by encoding strategy, while scaling has a comparatively minor effect. This study provides empirical guidance for designing robust and effective ML regression pipelines and underscores the importance of evaluating preprocessing techniques in conjunction with model selection.

Article Metrics

Abstract: 2 Viewers PDF: 2 Viewers

Keywords


Machine Learning Regression; Data Preprocessing; Feature Scaling; Categorical Encoding; Pipeline Evaluation

Full Text:

PDF


Refbacks

  • There are currently no refbacks.



Barcode

IJIIS: International Journal of Informatics and Information Systems

ISSN:2579-7069 (Online)
Organized by:Departement of Information System, Universitas Amikom Purwokerto, IndonesiaFaculty of Computing and Information Science, Ain Shams University, Cairo, Egypt
Website:www.ijiis.org
Email:husniteja@uinjkt.ac.id (publication issues)
  taqwa@amikompurwokerto.ac.id (managing editor)
  contact@ijiis.org (technical & paper handling issues)

 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0