Last week he was successful to clear a telephonic interview with a prestigious company and was given a machine learning problem to solve in a week. Sunday was the third day of the given week and the results which his algorithm had started to roll out were very far away from decent accuracy, making him run amok and here’s why?
Assuming the distribution to be normal is what we more or less do before doing an analysis but sometimes assuming this outrightly backfires at you at the later stages. After hearing what the problem is and what he has done to solve it, I said “I think I know what needs to be done”
He didn’t study the outliers and their placement in the distribution which I request everybody to get rid of at the very beginning by doing data examination. If you want to know how to examine your data graphically you can read my post. What are outliers, anyway?
The Working and Playing
Let’s say you are studying the household income of 500 houses in your neighbourhood. 470 of them earn in the range of 1000$ per month, 25 earn between more than 2000$ but less than 3000$ and remaining 5 earn more than one thousand dollars per month.
The mean of 495 houses would be something short of 1100$, but if we consider all the 500 houses our mean would be around 2100$, which gives a tremendous shift in the general mean and standard deviation and hence these 5 households which are a reason for this shift are outliers in this case study. What effect do outliers bring to the distribution?
Presence of outliers will guarantee that distribution will not be normal and an analyst would not get better results if he or she assumes such a distribution as a normal one. Abhijaat too was a victim of such an assumption. I straightaway asked him to draw the distribution and let me know if there is any skewness. You must be thinking why am I bringing it up skewness so late?
“Measure of skewness tells us the direction and the extent of skewness. In symmetrical distribution the mean, median and mode are identical. The more the mean moves away from the mode, the larger the asymmetry or skewness”- Simpson and Kafka
The presence of outliers forbids any distribution from becoming symmetrical and when a distribution is not symmetrical it is called skewed distribution. There are many sophisticated ways to identify and measure the skewness but the easiest is the Absolute measure of skewness.
- The values of mean, median and mode do not coincide
- When data is plotted on a graph they do not give a normal bell-shaped form.
- Sum of positive deviations from the median is not equal to the sum of negative deviations.
In any machine learning problem, we are not allowed to remove any observations from the test dataset. Which means if there is an observation which you found out to be an outlier, you can’t just remove it and get away with it. Analysis has to be done with it or else your submission wouldn’t be accepted.
The Success and Playing
I asked Abhijaat to take the logarithm of the independent variable where outlier values lie. If we can’t remove the observations with outlier, transforming the values is the best we can do to bring the outlier values closer to mean.
We took the logarithm because it suited to the kind of variable we were dealing with, that doesn’t mean we are restricted to it. There are many other transformation tricks you could use. Like?