Abstract

In this onboarding project, we were tasked with using Python to train a regression model—either Random Forest or Linear Regression—on a given dataset. I decided to investigate a specific question: What is the smallest training set size that still yields near-optimal R² and RMSE values?

The results show that the commonly used 80% training split is not necessarily the most optimal. For both models, smaller training proportions can achieve similar performance, suggesting more efficient data usage is possible.

Contents

Introduction

The onboarding project for the Unmixing Team introduces new members to machine learning techniques such as Random Forest and Linear Regression. The dataset provided (simpler_data.csv) is designed to fit regression analysis.

Although the project instructions recommend training on 80% of the data, this proportion appears to be arbitrary. My goal was to determine whether a smaller training set could achieve comparable accuracy, thereby saving time and computational resources.

Purpose of The Chapter

The purpose of this study is to identify the minimum training set size required to achieve stable and reliable R² and RMSE scores when modelling the provided dataset.

Important Background and/or Theory

This analysis employed Python along with the following libraries: matplotlib, sklearn, and pandas.

The two regression techniques used were Random Forest Regression and Linear Regression.