Tree-Based & Ensemble Models in Machine Learning

Tree-based models remain a cornerstone of classical machine learning, especially for structured data. They’re simple, intuitive, versatile, useful for interpretation, and frequently used as reliable baselines—or even final models—in both industry and research settings.

What are Tree-Based Models?

Tree-based models use a flowchart-like structure where data is split based on feature values to make predictions. Each internal node represents a decision on a feature, and each leaf node represents an output (e.g., a class label or regression value). Common examples include Decision Trees, Random Forests, and Gradient Boosted Trees. Among them, Random Forests, and Gradient Boosted Trees are ensembles of Decision Trees.

Why Tree-Based Models?

Handle both numerical and categorical features
Require minimal preprocessing (e.g., no need for feature scaling/normalization)
Capture non-linear relationships and feature interactions
Offer interpretable results (feature importance, decision paths)

Ensemble Models: Boosting vs Bagging

Bagging (e.g. Random Forest):

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance
Averaging many deep decision trees trained on bootstrap samples, efficiently reduces overfitting
Random forests reduce the similarity of decision trees in the ensemble by forcing each split to consider only a subset of the predictors

Boosting (e.g. AdaBoost, XGBoost, LightGBM, CatBoost):

Builds trees sequentially, correcting errors from previous ones
Can achieve state-of-the-art performance on many benchmarks
Often more sensitive to hyperparameters

How to get feature importance?

1. From model itself

Use built-in attributes of the models (e.g., Gini importance, gain)
Available in libraries like scikit-learn, XGBoost, and LightGBM

2. Data-driven methods

Use permutation importance or model-agnostic tools like SHAP
Provide more consistent and interpretable estimates

When to use Tree-based models?

On small to medium-sized tabular datasets
For feature importance and model interpretability
As strong baselines before trying more complex deep models

Useful Tools & Libraries:

scikit-learn: Simple, solid implementations of decision trees and random forests, integrated with feature importance calculation (entropy of node splitting)
XGBoost, LightGBM, CatBoost: Optimized gradient boosting libraries with great performance and features
SHAP: For model explainability and feature attribution

Reference

[1] James, Gareth, et al. An introduction to statistical learning. Vol. 112. No. 1. New York: springer, 2013.

[2] Scikit-learn documentation