Are you going to be interviewed for a data science role soon? If yes, then you should not miss preparing decision trees. Here are the questions that are most asked and are most searched on the net about the decision tree.
Decision tree is one of the most commonly used machine learning algorithms which can be used for a variety of problems. In this post on a lighter note, we will discuss top problems searched by the students on the net and interviewers’ favorite.
-Example of a Decision tree
A loan providing company does a survey to find out what kind of people take the loan most, after the survey, it demonstrates its findings as follows.
Divides person into 3 groups (less than 20 years of age, 20-50 and above 50. People of age less than 20 takes 2%, people with age more than 50 take 5% and people in the age range of 20-50 take 93% of all the loans
People in the range of 20-50 are further categorized into two groups. Unmarried and married. Unmarried take 10% of the loans and Married people take 90%.
Married people are further divided into two parts. No kids and with kids, People with no kids take 20% of all the loan and people with kids take 20% loans
With the help of decision tree, it can be concluded that people in the age range of 20 to 50 years, who are married and have kids qualify to be the target audience, whom the company should pitch the loan schemes.
-How DOES a decision tree algorithm work?
Step1 – Take the best attribute in the dataset as a root node.
Step-2 Start splitting the data into subsets, each subset should belong to some attribute of the data.
Step 3- Repeat Step 2 till all the branches of the tree are only left with leaf nodes
Note- Try to place the dependent variable of the data on the leaf nodes.
In the above figure ‘age’, found to be the best attribute in the dataset and though placed root.
Data split into the subset based on the basis of age, marital status and whether a couple has kids or not on different levels of the tree.
The leaf node of all the branches is the percentage of how many percents of loans are taken
-WHERE TO USE DECISION TREE?
These are the top five areas where a decision tree works most efficiently
– Loan approval
– Forecasting the future possibilities in finance
– Purchasing power and willingness of the customer
– Deciding which product to release as per the market demands
– Disease recognition in medical sciences
– CAN DECISION TREE BE USED FOR CLASSIFICATION AND REGRESSION?
Decision trees can be used in either of the problem types though it has been learned over a period of time that they perform much better in classification problems.
– HOW DECISION TREE HANDLES MISSING DATA?
Initially, the decision trees weren’t capable of handling the missing values but those times are gone. With the advancement in algorithms used in the decision trees, now it can handle the missing values but how it will be managed differs from algorithm to algorithm. ID3 doesn’t handle the missing values but CART and C4.5 do. Following are the general steps taken
1- Missing values go to the node which has the largest number of instances.
2- Goes to the children whose weights are less.
3- Randomly goes to the node who has a single child node.
– HOW TO INTERPRET DECISION TREE RESULTS?
You can produce the confusion matrix aka error matrix across all the platforms which give you the count of correctly and incorrectly classified records.
The above (2X2) matrix tells you the number of correctly and not correctly classified records with an overall error rate which can be used to further pruning the technique used by you.
– HOW IS DECISION TREE USED FOR TEXT CLASSIFICATION?
The process of classifying the text using Decision tree has 3 phases.
Text Preprocessing – In this phase the text is prepared so that a decision tree can learn as fast as possible to classify it this phase has three steps as follows
1. Optimise the performance for the next phase
2. Filtration– removing the stop words from the text
3. Stemming– removing the prefixes and suffixes from the text
4. Applying “Boolean Vector” for classification
Implementation– In this phase, finite data is developed from statistical data and algorithms are applied to classify the tex
Visualisation– The extracted results from the implementation phase are reviewed in textual or graphics formats.
Thanks for reading. The post will be updated frequently as new questions show up in statistics.