“It can be better!” was the feedback I got from my parents when I made tea for the first time. It was 1st January, I had taken many resolutions and learning to cook was one of them.
After trying out for many days in pursuit of making a drinkable tea. I realised that two things are ruining my tea every time. First was the tea leaves and the other was sugar.
Somehow I was never able to make a balance between them. Sometimes tea would be too strong that no one wants to drink it or sometimes sugar would be too high that my parents would make weird faces while drinking it.
As a perfectionist in my head, I took it very seriously and made some rules for myself. Nothing would be added in one go. I will start with the less quantity, taste it and if tea is mild or sugar is less I would add the needed ingredients.
This was my model of accuracy back then for making a perfect tea and it worked. In no time I was almost making perfect tea. The method was being used for my other cooking adventures too and every time it worked until my mother found what I was doing.
My mother was angry with the fact that I had been serving them half-eaten/tasted food all this time. To which she stopped tasting anything I was making and stopped others in the family too from eating it.
In India, the majority of people live by this belief that tasting the food before serving, makes it impure. It was too unfair to me but I took it as a challenge. I considered that it as my training and decided not to use it for further cooking.
Preparing the dishes which I had made in my training was easy for me now without tasting it but, I got fucked up again when it came to making new dishes.
You can say that either my model of accuracy which I chose was wrong or it didn’t work efficiently for testing(making without pre-tasting) and this happens to be happening all the time in machine learning too.
There is no free lunch in machine learning. You will never get a method which will work splendidly for all the problems, also don’t be surprised if your approach doesn’t work for one problem having two different datasets.
Selecting the best approach can be one of the most challenging parts of performing any kind of statistical learning. In this post lets discuss some of the concepts which you should keep in mind while choosing a selecting procedure for your data set.
WHY FIT IN WHEN YOU WERE BORN TO STAND OUT
Predicting as close as possible to the actual answer is our topmost priority. In regression problems, we use Mean Squared Error to measure how far the predicted values are from the observed values.
where yi is the observed value and f ^ is the predicted value for that particular observation. The MSE would be small when these two values would be closer.
You can calculate MSE for the training dataset only. As you don’t know the observed values for the test dataset, it would be proper to call it training MSE. Minimum training MSE is what we aspire, but it can’t do much for you in predicting accurately when it comes to the testing dataset.
It has often observed that despite having a low training MSE the results on the test data set are very poor. One way to improve this by dividing your training dataset into two parts. 75% for training and 25% for testing and using the later part as unseen data. Now if the MSE stands true for that small unseen data, you can go forward by assuming that it will work for your original test dataset.
Many researchers though claim that reducing the training data will bring more disadvantages. As you have reduced the dataset your model won’t learn as efficiently as it should. Which 25% of the dataset have you kept aside also plays the role in model accuracy. Which observations are you not allowing your model to learn from?
The other and most often used method is to plot the fitting line your model has generated for the training dataset.
Let’s consider this plot. Three ways are used to fit this scatter plot. The orange line represents the linear regression fit. Linear fit has no flexibility and may produce underfitting, most of us when in our initial stages always try to fit a regression problem with a linear fit. We can’t help it, we have been conditioned in such a way that whenever we encounter a regression problem we often start with a linear fit.
The green fit is the most flexible out of all three and truly tries to consider all the given data point. Trust me, out of all the three fits this would be the one with the least train MSE, but it would fail drastically when it comes to predicting for the test data set. Why?
Coz it is trying too hard to get all the points in the fit and picking out some patterns that may be just caused by the random chance. If you get the chance to plot the test MSE for this fit, you will notice that initially, the test MSE would decrease but after a certain point, it will start to increase exponentially. When the train MSE is minimum but the test MSE is too large we call it a overfitting.
The blue fit is neither too rigid as the orange(linear) nor too wiggly like the green fit. It has attained the balance and is almost in terms of the black fit(optimum).
Whenever you are trying to assess your model accuracy. The study of the fit and can help you identify what’s working and what not?
VARIANCE AND BIAS
The test MSE depends on three factors. Variance, bias and the error which will be there no matter what you do.
Variance refers to how much your output will change if you use a different dataset for training. Ideally, for every dataset, the variance should be the same or there should be nominal changes. If the method applied has a high variance then a small change in the dataset would change the output drastically.
The green fit which we have discussed above has the high variance and slight change in the data would blow off the entire model. On the other hand, the linear fit would have the least variance but greater bias. BIAS?
For example, you assumed that there is a linear relationship between input and output whereas, in reality, it is not a linear relation and would produce the result with too high a bias.
Variance and bias are two important points to be studied before finalising your model for implementation. Always try out various tuning in the model so that you should have low bias and low variance and hence low MSE.
In both regression and classification problems, choosing the correct level of flexibility is important for the success of any machine learning method.
I would write more on classifiers and techniques to minimise the error rates in upcoming posts under the same category.
Please leave a comment and let me know how helpful you found this post.