Random Forest Hyperparameter Tuning

Final report:

https://github.com/utat-ss/FINCH-Science/blob/main/Unmixing_Methods/Unconventional_Unmixing_Methods/RandomForest/hyperparameter_tuning/random_forest_hyperparameter_tuning.ipynb

The best possible R^2 I managed to achieve on validation are <0.63. Random Forest is constantly over fitting (to liker^2>.90 on the training data), and nothing I tried so far was able to deal with it. All of Sklearns hyperparameter tuning methods (GridSearchCV/RandomizedSearchCv) produce overfitting models and max the parameters.

We are using https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

Best tuning found so far:

# random state 42
# simpler_data
# 900 - 1700nm
train_X, validate_X, train_y, validate_y = sklearn.model_selection.train_test_split(spectra, npv_fractions, test_size=0.2, random_state=42)
rf = sklearn.ensemble.RandomForestRegressor(
    n_estimators=200,    
    max_depth=18,    
    max_features = 13,    
    min_samples_split=3,    
    min_samples_leaf=1,    
    min_impurity_decrease=0.0,    
    min_weight_fraction_leaf=0.0,    
    random_state = 42,)
rf.fit(train_X, train_y)
print("Training R^2:",round(rf.score(train_X,train_y),4)) # 0.9392
print("Validation R^2:",round(rf.score(validate_X,validate_y),4)) # 0.6309

Overview of Tunable Hyperparameters

n_estimators the number of trees (Time - Precision tradeoff). 50-200 works about the same. Theoretically, >200 should have no interesting effects on r^2, as we are already at +-0.005 error margins.
max_depth the depth of the trees (Over fitting - Precision tradeoff). Maximum performance is reached around 15. Greater max_depth only improves the fit to the training data, with no effects on the validation dataset. Note: you might think max_depth multiplies by n_estimators, but this doesn’t appear to be the case.
max_features the number of features (wavelengths) considered when splitting nodes. (Time - Precision tradeoff). Very important for us, because we have a lot of wavelengths per sample. Peak performance seems to be >10 or <0.5 (Second is a fraction)
min_samples_split=2 2 works the best, and sklearn requires at least 2. (Precision - Overfit tradeoff I think)
min_samples_leaf=1 (Overfitting - precision tradeoff) and 1 has the best r^2. Effective at decreasing fit to training data, but minimal effect on validation.
min_weight_fraction_leaf=0. Increasing it to 0.005 only drops the r^2.
min_impurity_decrease=0 Even 0.0002 already starts decreasing r^2. No significant positive effects.