Deepankar · follow
Published in · 7 minutes of reading · April 19, 2021
7 minutes of reading
April 19, 2021
The decision tree is one of the most basic machine learning algorithms that has a wide range of use cases that are easy to interpret and implement. We can use a decision tree for regression and classification tasks. In this article, we will try to understand the basics of the Decision Tree algorithm. Then, the method of generating a decision tree from a set of training data using the CART algorithm.
The decision tree is anonparametricThe technique of supervised learning, is a tree of many decision rules, all these rules will be derived from the characteristics of the data. It is one of the easiest machine learning algorithms to understand and explain. The ML algorithm is the most basic component of the Random Forest, which is the most popular and powerful ML algorithm.
- Decision tree structure:
- In the image below, I tried to show what the decision tree will look like. Each internal node represents a segment or region. Referring to the tree analogy, segments or regions are the nodes or leaves of the tree.
- root node: This is the first node that is our training dataset.
- Inner node: This is the point where a subgroup splits into a new subgroup or leaf node. We can also call it a decision node because here a node is further subdivided based on the best attribute of its subgroup.
- leaf node: terminal node of any internal node, keep decision.
2. About us in the decision tree?
- As mentioned earlier, a decision tree is a tree-like structure that will have nested nodes, and the splitting from one node to another is based on an attribute's threshold value, which we'll discuss in detail shortly.
- The decision tree algorithm divides the training set (root node) into subgroups -internal nodes& will be every internal node with a finite subgroupLeaf node.that we can call itrecursive partitioning.
Now we're going to go a little deeper to understand the statement "How does node splitting occur based on attribute thresholds?"
The division of a node (root node) into subnodes is based on purity. The decision tree algorithm splits a node where it finds better homogeneity of subnodes. If the subnode has all members of the class thenhomogeneitywill be bigger. If your subnode has a 5/5 class distribution, then the homogeneity will get less and less if it is 8/2 or 9/1.
To split a node, the decision tree algorithm has to be betterattribute and threshold value. Choosing the best pair of attributes and threshold values (f, t) happens based on the algorithms below which will give you the cleanest nodes. The following algorithms help you find the best attribute measures:
- CART Algorithm:Gini index
- ID3 algorithm: information gain
- C4.5 Algorithm: Win Rate
In this article, I will use the CART algorithm to create a decision tree.
This algorithm can be used for classification and regression. The CART algorithm uses the criterion of the Gini index to divide a node into subnodes. It starts with the training set as the root node, after successfully splitting the root node in two, it splits the subsets using the same logic and splits the subsets again, recursively, until it finds an additional split that does not clear any subnodes or the largest number of leaves in the growing tree, i.e. pruning trees.
How to calculate the Gini index?
In the Gini index, P is the class probabilityueand it is totalCTeaching.
Assuming you only have two predictors/attributes: humidity and wind
Class: rainy and sunny
IG = 1 - ((Number of observations Feature_1/total observation)² + (Number of observations Feature_2/total observation)²)
IG = 1-((6/10)² + (4/10)²) => 1-(0,36+0,16) => 1–0,52 => 0,48
Thus, the Gini index for the first/initial set is 0.48
The basic concept of splitting nodes:
Based on the attribute "wind" (f) and the threshold value "3.55" (t), the CART algorithm generated nodes/subsets that would produce pure subsets for the right side of the stream above (see Figure 4).
We understand how to choose the best pair (f, t) to split the root node:
Step 1: Find the best initial Gini index/score set
I wrote a small code snippet for better understanding:
Now after loading the data, we will find the Gini score for the start set or root node, which will be the result of best_gini:
From the code above, we get input for the following set of instructions to execute:
- Line #2: Number of Resources = 2
- Line number 9: placeholders for best attribute (best_attribute) and constraint (best_thr)
- Line #11: best_gini score in initial set = 0.48 (see Figure 4)
Step 2: Find the best breakdown of your starter/training setup
Now to explain the most important part, how are we going to look for the best attribute (f) and threshold (t) from the initial set?
- Line 1:8 - The algorithm will repeat the "attribute" number of times, i.e. 2, and create two containers on the left and right. The left one will have nothing assigned, while the right one will have all sorted row values called thresholds
- Line #14:26 - The two initialized containers (left and right) will enter another loop that will repeat the number of rows times -10, on each iteration the algorithm will assign each class observation from right to left and calculate the weighted average Gini, new_gini always
- Line #35:38 - If new_gini is less than best_gini then we find the best attribute and link (f,t)
Code flow: the algorithm selects each attribute, sorts its list of values (which we call the bounds list), sequentially moves each observation to the left bin, computes new_gini (Ginijeva ponderirana sredina) compare it with the initial Best_Gini value, if the new_gini score is lower, keep it as the best attribute (f) and threshold value (t) and repeat the process again for all attributes. Then use the best attribute (f) and threshold value (t) to split the node into subnodes/subsets.
The picture below tries to summarize the flow, here the green circles represent the student: Sunshine, and the red triangles represent the student: Rain.
Example of the best split
To make it clearer how the best split is obtained, consider the "Moisture" attribute in the 7th iteration, where the distribution is as follows:
The equation immediately after the distribution stage is a function of CART or costGinijeva ponderirana sredinawhich will give you a new Gini result (new_gini):
The result of new_gini will be based on the distribution of classes in the left and right group:
- Left bucket: 2 sunny and 5 rainy
G_left = 1 - ((number of sun observations/total observation)² + (number of rain observations/total observation)²)
G_lewy = 1-((2/7)²+(5/7)²)) => 0,40
- Right bin: 2 sunny and 1 rainy:
G_right = 1 - ((number of sun observations/total observation)² + (number of rain observations/total observation)²)
G_desno = 1- ((2/3)²+(1/3)²)) => 0.44
m_left = 7, m_right = 3
Novi_Gini = (7*0.40)+(3*0.44)/10 => 0.41
Which is less than best_gini(0.48), so we'll consider this as the best threshold and attribute value:
Limit List = [1.5, 1.5, 1.5, 1.6, 3.4, 3.9, 4.6, 4.7, 5.0, 5.1]
If we remember that we are in the seventh iteration, the value of best_threshold will be = (4.7+4.6)/2=>4.66
The best feature so far is the (f,t) threshold (0, 4.66) and if this is the lowest Gini score, the algorithm will split the start node based on the "Moisture" attribute and a threshold value of "4.66". , it would look like the image below:
But it isnot the best couple(f,t). The above process will continue for all available attributes and will continue to search for the new lowest Gini score, if found, will keep the threshold value and its attribute, further split the node based on the best attribute and threshold value. According to our dataset, Gini's best score is "0.40" for the attribute "Wind" (F) and "3.55" as the best cut-off value (t). The following tree, generated by DecisionTreeClassifier using scikit-learn, shows that the splitting of nodes was based on the same threshold value and attribute:
- We learned what a decision tree is
- Various criteria metrics for creating new nodes/subsets
- Gini index
- How to calculate the Gini index
- An example of creating nodes based on attributes (F) & Threshold value (T)
- Since we are looking for the best attribute, a pair of threshold values (f, t)
- Because the value of the pair (f, t) helps in creating pure nodes
- Decision tree working code
Thank you and enjoy reading…