Data Examination Part-1: Graphical Examination

You all must have watched many detective films or must have read detective novels in which the tiniest idea or evidence or fact which no one cares to notice in the beginning turns out to be the major element of protagonist’s muddle at the end.
I am sharing this from my hands-on experience as Data Science practitioner that there were numerous similar instances in my journey where I got the data and I didn’t examine it in the proper way in the inception phase or I didn’t examine it altogether just because I was more tempted to apply the machine learning algorithms ASAP thus, paying huge penalty at the end. My results were crappy and I had no clue what led up to it. When you are short of time the worst begins when you start looking for answers in the wrong directions.
It took me a hell of a time to understand what the causes were and I am too excited to share some proven strategies with you, which if applied correctly will never put you in the position which I have just mentioned above.

WHAT IS DATA EXAMINATION?

Whenever you embark to solve Bussiness problem which has data in it, your first liability ever is to understand the problem. This is the most most most important thing to do because if you fail to do that, no matter how much hard work you put into it you are destined to fail.
When you get a crystal clear image of what the problem is, now it is time to dig your hands into data and to examine everything about it. Is it in accordance with the problem? Does it have a missing data? If yes, to what extent? Does it have outliers? All such sorts of questions you should ask in the very initial stages. By doing this you will be having two gains-
1. You will get the basic understanding of each variable and what sort of relationship do variables share among themselves.
2. What sort of statistical characteristics your data has.

As data starts to get bigger and bigger with the huge number of variables within it, it starts to become more difficult for any researcher to analyze it to form a proper understanding without any lapse. With so many digits, type of variables and variables whose names no one has ever heard of, getting insight from statistics is not only difficult but boring too. Graphical representation of data is a boon which has a divine knack in it that can tell anything about the data, variables and the relationship among variables.

In this post, I will tell you how you can examine your data graphically. To make it painless I will divide it into three parts and then address it.

1. The shape of the distribution-

The most idiot-proof way to understand any metric is to understand the shape of its distribution and you can establish that with simple HISTOGRAM. Yes, you read it right, a simple histogram will tell you everything you need to know about the distribution of the metric and the best part being, it never discriminates variables. Continuous or Categorical all are welcome.
In just a matter of few seconds, you will have information such as whether it is a normal distribution or not and certainly about the skewness.

2. Examine the relationship between variables(continuous)-
After understanding each important metric you thought of, the next logical step would be to understand the nature of the relationship between those metrics.
SCATTER PLOT and SCATTER PLOT MATRIX
If you want to understand the relationship between two variables use scatter-plot with the line of best fit enabled and if you want to understand what sort of relationship variables shares with all other variables use scatter-plot matrix.
Scatter plot matrix will also tell you about the correlation among variables, which is the best thing to know when you are on the quest of how strong or weak they are among themselves.

3. Examining Group Differences-
I want you to imagine two variables. First being DAY and TOTAL_BILL being the second.
DAY being a categorical feature and TOTAL_BILL a continuous one using a scatter plot to understand the relationship between these two would not be a good idea (Homework- Draw a scatter plot b/w a categorical and continuous variable). Using a BOX PLOT instead would be a great idea.

I hope you that you find this post helpful, thank you for reading.

Resources:

Photo by paultom2104