Final report:
random_forest_hyperparamter_tuning.pdf
The best possible R^2 I managed to achieve on validation are <0.63
. Random Forest is constantly over fitting (to liker^2>.90
on the training data), and nothing I tried so far was able to deal with it. All of Sklearns hyperparameter tuning methods (GridSearchCV/RandomizedSearchCv) produce overfitting models and max the parameters.
We are using https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Best tuning found so far:
# random state 42
# simpler_data
# 900 - 1700nm
train_X, validate_X, train_y, validate_y = sklearn.model_selection.train_test_split(spectra, npv_fractions, test_size=0.2, random_state=42)
rf = sklearn.ensemble.RandomForestRegressor(
n_estimators=200,
max_depth=18,
max_features = 13,
min_samples_split=3,
min_samples_leaf=1,
min_impurity_decrease=0.0,
min_weight_fraction_leaf=0.0,
random_state = 42,)
rf.fit(train_X, train_y)
print("Training R^2:",round(rf.score(train_X,train_y),4)) # 0.9392
print("Validation R^2:",round(rf.score(validate_X,validate_y),4)) # 0.6309
n_estimators
the number of trees (Time - Precision tradeoff). 50-200 works about the same. Theoretically, >200 should have no interesting effects on r^2, as we are already at +-0.005 error margins.max_depth
the depth of the trees (Over fitting - Precision tradeoff). Maximum performance is reached around 15. Greater max_depth
only improves the fit to the training data, with no effects on the validation dataset. Note: you might think max_depth
multiplies by n_estimators
, but this doesn’t appear to be the case.max_features
the number of features (wavelengths) considered when splitting nodes. (Time - Precision tradeoff). Very important for us, because we have a lot of wavelengths per sample. Peak performance seems to be >10
or <0.5
(Second is a fraction)min_samples_split=2
2 works the best, and sklearn requires at least 2. (Precision - Overfit tradeoff I think)min_samples_leaf=1
(Overfitting - precision tradeoff) and 1 has the best r^2. Effective at decreasing fit to training data, but minimal effect on validation.min_weight_fraction_leaf
=0. Increasing it to 0.005 only drops the r^2.min_impurity_decrease=0
Even 0.0002 already starts decreasing r^2. No significant positive effects.