L1 and L2 Regularization in ML

Why Do We Need Regularization?

Regularization is a technique used in machine learning to prevent overfitting. Overfitting occurs when a model learns the noise in the training data instead of capturing the underlying pattern, leading to poor generalization on unseen data. Regularization techniques add constraints or penalties to the model to improve its ability to generalize.

How Does Overfitting Happen?

  • High model complexity: Too many parameters can lead to memorization rather than learning meaningful patterns.
  • Insufficient training data: Small datasets may cause models to fit noise rather than actual trends.
  • Noisy data: Irrelevant or redundant features can mislead the model.

What is L1 and L2 Regularization?

L1 and L2 regularization are two common approaches to prevent overfitting by modifying the loss function of a model:

L1 Regularization (Lasso Regression)

L1 regularization adds the absolute value of the coefficients to the loss function: \(Loss = \text{MSE} + \lambda \sum |w_i|\)

  • Encourages sparsity by forcing some coefficients to be exactly zero.
  • Useful for feature selection as it removes irrelevant features.
  • Can lead to better interpretability.

L2 Regularization (Ridge Regression)

L2 regularization adds the squared value of the coefficients to the loss function:

\[Loss = \text{MSE} + \lambda\sum w_i^2\]
  • Penalizes large weights but does not shrink them to zero.
  • Helps in reducing multicollinearity (high correlation between features).
  • Generally leads to better stability and prevents large parameter values.

Elastic Net Regularization

Elastic Net combines both L1 and L2 penalties: \(Loss = \text{MSE} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2\)

  • Balances between feature selection (L1) and weight shrinkage (L2).
  • Useful when there are correlated features.
  • Often preferred in practical applications to get the best of both worlds.

Why is L2 regularization useful?

L1 regularization encourages sparsity and reduces model complexity. L2 regularization only penalizes large weights but does not shrink them to zero. How does L2 prevent overfitting?

  • When weights are large, small changes in input features can result in disproportionately large changes in predictions, making the model highly sensitive to noise (overfitting). By shrinking weights, L2 regularization forces the model to be more stable and less sensitive to small fluctuations in the training data.
  • L2 distributes the impact across multiple features, reducing the risk of the model relying too much on noise in individual features.
  • In cases where features are highly correlated, the model may assign arbitrarily large positive and negative weights to compensate for these correlations, leading to instability. L2 regularization reduces this effect by shrinking the coefficients and stabilizing the model.

Summary

L2 does not force weights to zero but shrinks them, leading to a more stable and generalizable model. By preventing excessively large weights, L2 reduces sensitivity to noise and avoids overfitting. It helps with multicollinearity and distributes learning across multiple features.

Conclusion

Both L1 and L2 regularization help mitigate overfitting, but they have different effects on model complexity and feature selection. L1 is ideal when feature selection is needed, while L2 is better for stable and well-distributed parameter values. In practice, Elastic Net, a combination of L1 and L2 is often used to balance both effects.

By understanding these techniques, you can make informed decisions when designing ML models and improving generalization performance!