Tree-Based & Ensemble Models in Machine Learning
Tree-based models remain a cornerstone of classical machine learning, especially for structured data. They’re simple, intuitive, versatile, useful for interpretation, and frequently used as reliable baselines—or even final models—in both industry and research settings.
What are Tree-Based Models?
Tree-based models use a flowchart-like structure where data is split based on feature values to make predictions. Each internal node represents a decision on a feature, and each leaf node represents an output (e.g., a class label or regression value). Common examples include Decision Trees, Random Forests, and Gradient Boosted Trees. Among them, Random Forests, and Gradient Boosted Trees are ensembles of Decision Trees.
Why Tree-Based Models?
- Handle both numerical and categorical features
- Require minimal preprocessing (e.g., no need for feature scaling/normalization)
- Capture non-linear relationships and feature interactions
- Offer interpretable results (feature importance, decision paths)
Ensemble Models: Boosting vs Bagging
Bagging (e.g. Random Forest):
- Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance
- Averaging many deep decision trees trained on bootstrap samples, efficiently reduces overfitting
- Random forests reduce the similarity of decision trees in the ensemble by forcing each split to consider only a subset of the predictors
Boosting (e.g. AdaBoost, XGBoost, LightGBM, CatBoost):
- Builds trees sequentially, correcting errors from previous ones
- Can achieve state-of-the-art performance on many benchmarks
- Often more sensitive to hyperparameters
How to get feature importance?
1. From model itself
- Use built-in attributes of the models (e.g., Gini importance, gain)
- Available in libraries like
scikit-learn
,XGBoost
, andLightGBM
2. Data-driven methods
- Use permutation importance or model-agnostic tools like
SHAP
- Provide more consistent and interpretable estimates
When to use Tree-based models?
- On small to medium-sized tabular datasets
- For feature importance and model interpretability
- As strong baselines before trying more complex deep models
Useful Tools & Libraries:
-
scikit-learn
: Simple, solid implementations of decision trees and random forests, integrated with feature importance calculation (entropy of node splitting) -
XGBoost
,LightGBM
,CatBoost
: Optimized gradient boosting libraries with great performance and features -
SHAP
: For model explainability and feature attribution
Reference
[1] James, Gareth, et al. An introduction to statistical learning. Vol. 112. No. 1. New York: springer, 2013.