Efficient ways for a balanced fit model
To get rid of the Over & Under fit models.
Motivation
While we start building some machine learning models, we will come across these overfit & underfit characters. I feel these are monsters in the model building which stops me to build a perfect model. To get rid of these I came across four different ways to balance the dataset as well as model.
There are four ways to get a balanced fit model.
- K Fold Cross-Validation
- L1 and L2 Regularization
- Principal Component Analysis
- Bagging & Boosting
Before going into these four ways we will see a few essential things and concepts.
Variance & Bias
Variance is the variability between two scenarios of a dataset while training & testing.
For example, In one scenario of the dataset gets a training error of 0 and a testing error of 75. In the second scenario training error of 0 and a testing error of 20.
- High Variance where testing error varies a lot between two scenarios of a testing error.
- Low Variance where testing error varies slightly with those scenarios of testing errors.
Bias is the measurement of how accurately a model can capture a pattern in a testing dataset.
- High bias when train error is big.
- Low bias when train error is small.
Here we have a Bull’s eye diagram of Variance and bias
When a model overfits on the testing dataset it gets high variance, if Underfit it gets high bias and in balanced fit it gets low variance & low bias.
K Fold Cross-Validation
It is one of the best ways to validate the testing dataset. It performs as folds on the complete dataset and then takes one fold for testing the model on each iteration.
L1 and L2 Regularization
It is also called Lasso and Ridge regression.
As we can see in the above figure, it states the formula of mean squared error. By adding lambda to it we will get L2 regularization which helps to control the error rate in MSE using lambda. When lambda is small the error will be less. It is penalizing a higher value of theta.
In L2 we are using a square on theta and in L1 we use absolute on theta. When theta is small the overall error will be small.
Principal Component Analysis (PCA)
Principal Component Analysis is used in ML to reduce the dimension.
- It figures out the most important features that impact the target variable.
- Before applying PCA we need to scale the features.
Bagging & Boosting
It is the combination of bootstrap & aggregation techniques.
- Bootstrap is a technique where a dataset divides into subsets of datasets using resampling with replacement.
- Aggregation is combining results of bootstrap with sum or average.
Conclusion
Through these ways, we can get rid of overfit issues and get set for balanced fit data for modeling. We can use any one of the above four ways to get the balanced fit model and it's also a trial and error way of using those ways which are suitable on the chosen datasets. Hope you learned some useful things for modeling.
Thanks for reading!!