What is a decision tree?
As (up-and-coming) data scientist / manager, you want to know what a decision tree is, and which problems you can tackle using decision trees. What solutions can they offer, and what are their limitations?
Decision trees 101: the restaurant
Eating in a restaurant can cause a lot of headaches, especially if they have a large menu. First you have to decide if you want an entree before the main course, and if you want a dessert. Then, you have to consider that not every combination of entree and main course will go together well. Also, not every kind of wine goes with every type of dish, and maybe you need a different wine for the entree than for the main course. Decisions, decisions.
A pleasant dinner or a collection of tricky choices?
A nice dinner at a fancy restaurant quickly devolves into a series of tough decisions, where some choices you make influence other choices, like the entree and the main course. And, once made, a choice affects the rest of the (hopefully) pleasant evening. To chart all the choices, you could make a decision tree, where every decision branches down to another decision.
Tough choices for the restaurant owner
The customer isn’t the only one who has to make tough decisions, the restaurant owner has a lot to consider as well. It remains to be seen whether his provisions match the orders made that night. To what extent can the owner predict the decisions of every customer? Which decisions do the customers make, and which dishes are most popular on the night in question? The restaurant owner can incorporate the final decisions in a decision tree.
Mutually exclusive options in a decision tree
Both the guest and the restaurant owner have to make a series of mutually exclusive decisions – a common situation. The question now is: to what degree can similar situations be applied to selecting and combining data? The situation is essentially reversed: the decision tree isn’t used to visualize information, but to make decisions.
The problems solved by decision trees
A decision tree is a technique that makes a division based on mutually exclusive properties. A blue car, of course, is not red, and a red car isn’t blue, or any other color. The color is an obvious distinguishing property of car. In the case of cars, the decision tree can go one of two ways:
- Single choice: In this case, the only choices are blue and not-blue.
- Multiple choice: In this case, there are options besides blue, like red, silver, black, white, etc.
On the other hand, there are also properties that are not useful in distinguishing items of a particular class. For example, almost all cars have four wheels, so this property loses its value as a distinguishing property.
Is the count sufficiently accurate?
One wonders to what extent rare car colors can provide reliable data. It’s a nice idea to include “British Racing Green” as an option, but there are only a few cars with that color, and so counting these in a decision tree may skew the results. The small sample size affects the reliability of predictions extrapolated out to a larger sample size. In other words: Given the sample size, how accurate and justifiable is the quality of the prediction?
Don’t make predictions based on non-existing data
A hypothetical parking lot contains five cars: one red, one blue, one white, one black, and one yellow. Based on this data, we might conclude that each color has a chance (Pcolor) of 0.2 to occur. Based on this data you might think there are no green, brown, or purple cars. However, it’s impossible to make a prediction based on this data set because it’s unclear whether these colors simply don’t exist, or that they just don’t appear in the test sample.
Coincidence or cause?
External properties of cars are often used to detect other, less conspicuous properties. One could, for example, conclude that red cars get more speeding tickets than other colors, and that white cars are double parked more often. Especially when making connections like this, it’s important to distinguish between an incidental link (coincidence), and a causal link. The question is, how can this technique of sequential choices be visualized and elaborated on in a decision tree?
The use of a decision tree
By making branching choices, a large data set continually gets divided into smaller and smaller subsets. A decision tree is made by visualizing this pattern. At first glance, this seems like a method of just visually separating data, and not a research method. Yet a decision tree that’s made by analyzing (part of the) data is an excellent way to make predictions. By creating a decision tree for a certain problem, we can determine whether the question is valid given the available information. Is it accurate enough to make predictions?
Do red cars get more speeding tickets?
Whether the question is valid has to do with the question being asked, like if red cars really get more speeding tickets than cars with other colors. The accuracy of the prediction is partly determined by the sample size and the outcome of the first analysis. To answer these questions, first we need to create a decision tree and investigate.
Two or more subsets
The overarching collection is divided into two or more subsets per decision, based on the number of chosen properties. When dividing into two subsets based on property x1, this can be written mathematically as:
- P X (x1) p1
- P X (x1) 1 p1
Then, the newly created subsets are sub-divided again. The group possessing property x1 is tested for the presence of property x2, the other group for property x3. Sometimes, only one branch is further developed.
Longer and shorter branches
Going back to the restaurant menu, it doesn’t make sense to incorporate entrees in the decision tree if an entree is not selected. The same goes for no dessert and the complete list of desserts. As a consequence, the decision tree gets longer and shorter branches, and some choices appear in different places.
The only open question remaining is which properties are eligible for incorporation in the decision tree, and in which order? In the end, we can determine the chance of certain properties appearing in certain combinations. It is of utmost importance that the choice of properties is based on objective reasons, determined by data, and not on subjective reasons, determined by the data analyst. Especially in the case of large amounts of data, automation is necessary.
A key aspect of finding the right decision tree for a certain data set is rate of entropy (H) of the set. Entropy is the amount of disorder in a set. If the entire data set is contained in a clearly structured decision tree then H = 0, because there is no disorder. For unstructured data sets, the rate of entropy is very high.
A new set of branches
An algorithm that structures a data set in a decision tree will reduce disorder step by step as it generates new branches. The decrease in entropy per step in the process is called Information Gain. At the end of the process, the algorithm has used every available property and decreased the H value to its lowest possible point for the data set.
Incorporating new information
As soon as the data has been incorporated into a decision tree, it can be used to process new information immediately. The assumption is that the data set used in building the decision tree is representative for all data.
The limitations of decision trees
Despite their intuitive nature, decision trees do have some limitations. One of the most important limitations is that every property has to lead to another decision. Not every data set allows for this. There are data sets where combinations of properties weigh a lot more than individual properties. In these cases it’s better and more reliable to create rules.
Cut the decision tree down to usable size
Another danger when using decision trees is making the tree too detailed. This causes every end point to have too few elements, and because of that too little weight. That’s why most algorithms include the option to cut the tree down to size to meet the minimum accuracy requirements. But these are options that have to be applied in moderation, and very carefully. Too often, cutting the tree down leaves only a trivial solution to the problem.