Don’t let the skewness screw you.

“I am screwed?” were the first words Abhijaat said to me in a muffled voice when he called me up late Sunday night. Abhijaat is an acquaintance of mine on LinkedIn and has started studying Data science a few months back all by himself and also has started applying in companies for the role of Data Analyst.

Last week he was successful to clear a telephonic interview with a prestigious company and was given a machine learning problem to solve in a week. Sunday was the third day of the given week and the results which his algorithm had started to roll out were very far away from decent accuracy, making him run amok and here’s why?

Scrambled Eggs

Assuming the distribution to be normal is what we more or less do before doing an analysis but sometimes assuming this outrightly backfires at you at the later stages. After hearing what the problem is and what he has done to solve it, I said “I think I know what needs to be done”

He didn’t study the outliers and their placement in the distribution which I request everybody to get rid of at the very beginning by doing data examination. If you want to know how to examine your data graphically you can read my post. What are outliers, anyway?

The Working and Playing

Let’s say you are studying the household income of 500 houses in your neighbourhood. 470 of them earn in the range of 1000$ per month, 25 earn between more than 2000$ but less than 3000$ and remaining 5 earn more than one thousand dollars per month.

The mean of 495 houses would be something short of 1100$, but if we consider all the 500 houses our mean would be around 2100$, which gives a tremendous shift in the general mean and standard deviation and hence these 5 households which are a reason for this shift are outliers in this case study. What effect do outliers bring to the distribution?

Presence of outliers will guarantee that distribution will not be normal and an analyst would not get better results if he or she assumes such a distribution as a normal one. Abhijaat too was a victim of such an assumption. I straightaway asked him to draw the distribution and let me know if there is any skewness. You must be thinking why am I bringing it up skewness so late?

“Measure of skewness tells us the direction and the extent of skewness. In symmetrical distribution the mean, median and mode are identical. The more the mean moves away from the mode, the larger the asymmetry or skewness”- Simpson and Kafka

The presence of outliers forbids any distribution from becoming symmetrical and when a distribution is not symmetrical it is called skewed distribution. There are many sophisticated ways to identify and measure the skewness but the easiest is the Absolute measure of skewness.

Absolute skewness = Mean – Mode
If the mean is greater than the mode, then skewness is positive else negative and if the difference is zero, we are not screwed. Sorry, I mean skewed. Following are some more tests of skewness.
  1. The values of mean, median and mode do not coincide
  2. When data is plotted on a graph they do not give a normal bell-shaped form.
  3. Sum of positive deviations from the median is not equal to the sum of negative deviations.
Abhijaat straightaway removed the outliers from the training dataset which were more than three times the standard deviation away from the mean. Sometimes we can’t afford to remove the outliers, more on that soon. On removing the outliers his algorithm started giving decent accuracy on the training dataset, it literally showed a 25% jump. After a few moments of relief, his next question was,”what we will do for the test dataset?”

In any machine learning problem, we are not allowed to remove any observations from the test dataset. Which means if there is an observation which you found out to be an outlier, you can’t just remove it and get away with it. Analysis has to be done with it or else your submission wouldn’t be accepted.

The Success and Playing

I asked Abhijaat to take the logarithm of the independent variable where outlier values lie. If we can’t remove the observations with outlier, transforming the values is the best we can do to bring the outlier values closer to mean.

We took the logarithm because it suited to the kind of variable we were dealing with, that doesn’t mean we are restricted to it. There are many other transformation tricks you could use. Like?

If an independent variable with outliers has all positive values, transform it with a logarithm, square root or even cube root. Variables with negative values can be treated with a square or cube or even one could take a square and then take out a log or root as situation demands.
Above mentioned transformations should be used with a pinch of experimenting spirit, whatever chops out the tail in the distribution, go with it. While I am writing this Abhijaat is in Mumbai giving his final technical interview. I wish he gets the job and you, my wonderful reader find this post helpful in your need. See you soon.