Efficient ways for a balanced fit model

Akshith Kumar
4 min readDec 29, 2021

To get rid of the Over & Under fit models.

Motivation

While we start building some machine learning models, we will come across these overfit & underfit characters. I feel these are monsters in the model building which stops me to build a perfect model. To get rid of these I came across four different ways to balance the dataset as well as model.

Photo by Tingey Injury Law Firm on Unsplash

There are four ways to get a balanced fit model.

  • K Fold Cross-Validation
  • L1 and L2 Regularization
  • Principal Component Analysis
  • Bagging & Boosting

Before going into these four ways we will see a few essential things and concepts.

Variance & Bias

Variance is the variability between two scenarios of a dataset while training & testing.

For example, In one scenario of the dataset gets a training error of 0 and a testing error of 75. In the second scenario training error of 0 and a testing error of 20.

  • High Variance where testing error varies a lot between two scenarios of a testing error.
Source: “Image by author”
  • Low Variance where testing error varies slightly with those scenarios of testing errors.
Source: “Image by author”

Bias is the measurement of how accurately a model can capture a pattern in a testing dataset.

  • High bias when train error is big.
Source: “Image by author”
  • Low bias when train error is small.
Source: “Image by author”

Here we have a Bull’s eye diagram of Variance and bias

Source: “Image by author”

When a model overfits on the testing dataset it gets high variance, if Underfit it gets high bias and in balanced fit it gets low variance & low bias.

K Fold Cross-Validation

It is one of the best ways to validate the testing dataset. It performs as folds on the complete dataset and then takes one fold for testing the model on each iteration.

Source: “Image by author”

L1 and L2 Regularization

It is also called Lasso and Ridge regression.

Source: “Image by author”

As we can see in the above figure, it states the formula of mean squared error. By adding lambda to it we will get L2 regularization which helps to control the error rate in MSE using lambda. When lambda is small the error will be less. It is penalizing a higher value of theta.

Source: “Image by author”

In L2 we are using a square on theta and in L1 we use absolute on theta. When theta is small the overall error will be small.

Source: “Image by author”

Principal Component Analysis (PCA)

Principal Component Analysis is used in ML to reduce the dimension.

  • It figures out the most important features that impact the target variable.
  • Before applying PCA we need to scale the features.

Bagging & Boosting

It is the combination of bootstrap & aggregation techniques.

  • Bootstrap is a technique where a dataset divides into subsets of datasets using resampling with replacement.
  • Aggregation is combining results of bootstrap with sum or average.
Source: “Image by author”

Conclusion

Through these ways, we can get rid of overfit issues and get set for balanced fit data for modeling. We can use any one of the above four ways to get the balanced fit model and it's also a trial and error way of using those ways which are suitable on the chosen datasets. Hope you learned some useful things for modeling.

Thanks for reading!!

--

--