Decision Tree
Description	Decision Tree builds a classification model in the form of a tree structure where each leaf node represented a class or a decision.
Why to use	Data Classification
When to use	When the data is categorized into Boolean values.	When not to use	It should not be used when the dependent variable is continuous.
Prerequisites	It should be used only when the dependent variable is discrete.
Input	--	Output	--
Related algorithms	--	Alternative algorithm	--
Statistical Methods used	Dimensionality Reduction Maximum Entropy Gini Index	Limitations	Can be used only on numerical data. Can be used only for predicting categorical data

Decision Tree is located under Machine Learning ( ) in Classification, in the task pane on the left. Use drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Decision Tree.

Decision tree algorithm falls under the category of supervised learning. It can be used to solve both regression and classification problems.

Decision tree is used in classification when the predicted outcome is the class to which the data belongs. Decision tree builds classification model in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision and internal node represent attributes of the tree.

Any Boolean function with discrete attributes can be represented using the decision tree.

Let us take a simple example of a person considering a new job offer. A simple decision tree is shown in the below figure.

As seen in the above figure, a person’s decision to accept or reject a job offer depends on a number of conditions. At each intermediate node, a condition is verified and based on the answer (Yes/No), the decision-making process continues. All the leaf nodes denote the decision based on the attributes preceding that node.

Properties of Decision Tree

The available properties of Decision Tree are as shown in the figure given below.

The table given below describes the different fields present in the properties of Decision Tree.

Field	Description	Remark
Task Name	It displays the name of the selected task.	You can click the text field to edit or modify the name of the task as required.
Dependent Variable	It allows you to select the dependent variable.	Any one of the available categorical variables can be selected.
Independent Variable	It allows you to select the experimental or predictor variable(s)— .	Multiple data fields can be selected. Data fields with only numerical values must be selected. ??
Criterion	It allows you to select the Decision-making criterion to be used.	Values are: entropy and gini.
Maximum Features		Auto, sqrt, log2, None
Splitter	It allows you to select the criterion for feature selection.	Values are: Best - It selects the most relevant features of the dataset. Random - It selects random features of the dataset.
Maximum Depth	It allows you to set the depth of the decision tree.	The more the depth of the tree, the more accurate the prediction. However, more depth also takes more time and computation power. So, it is advisable to choose an optimum depth.
Random State
Dimensionality Reduction	It is the process of reducing the number of input variables.	The values are: None: Dimensionality reduction is not done. All the features (columns) of dataset are considered for prediction. PCA: Principal Component Analysis – Finds the maximum variance in the features of a dataset and projects them into a dataset with lesser features.

Concepts in Decision Tree

When we are trying to build the Decision Tree, we start with a root node and recursively add child nodes until we fit the training data perfectly. The problem that we face is to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned.

To do this we use entropy & information gain.

Entropy

It is used to calculate homogeneity of a sample. It gives us the measure of impurity in our class. It can be roughly thought of as how much variance the data has. The lesser the entropy the better the output. The entropy typically changes when we use a node in a decision tree to partition the training instances into smaller subsets.

The lesser the entropy, the better the output is.

Information Gain

Information gain can be understood as decrease in "uncertainty" once the value of the attribute is known. Information gain is a measure of this change in entropy. Information gain tells us how important a given attribute of the feature vectors is. We can use it to decide the ordering of attributes in the nodes of a decision tree.

A feature with highest Information Gain is selected as the next split attribute.

Assume there are two classes P and N and let the set of training data S (with a total number of object s) contain p element of class P and n elements of Class N. So, entropy is calculated as,

Entropy = -(n/s) log(n/s) -(p/s) log(p/s)

Information Gain can be thought of as how much Entropy is removed or how unpredictable the dataset is.
So, Information Gain = 1 – Entropy

The lesser the value of Information Gain, the better the dataset is.

Gini

Similar to Entropy, it gives us the measure of impurity in our class. It is used to decide the next attribute to split the data on. The formula to calculate Gini Index is,

Gini = 1 - ∑(n/s)²

A feature with a lower Gini Index is selected as the next split attribute.

Interpretation from results of Decision Tree

Sensitivity

In binary classification, sensitivity (also called a true positive rate or hit rate or recall) is the ability of a test to correctly identify the positive results. Thus, it is the ability to test positive where the actual value is also positive.

It is quantitatively measured as the True Positive Ratio (TPR) given by

TPR = TP / (TP + FN)

Where,

TP = number of true positives

FN = number of false negatives

Specificity

In binary classification, specificity (also called inverse recall) represents the proportion of negative results that were correctly classified. It is the ratio of the correctly classified negative samples to the total number of negative samples.

It is quantitatively measured as

Specificity = TN / (TN + FP)

Where,

TN = number of True Negatives

FP = number of False Positives

F-Score

In binary classification, F-score (also called the F-measure or F₁ score) is a measure of the accuracy of a test. It is the harmonic mean of the precision and the recall of the test.

It is calculated as,

F-score = 2 (precision × recall) / (precision + recall)

Where,

precision = positive predictive value, which is the proportion of the positive values that are positive

recall = sensitivity of a test, which is the ability of the test to correctly identify positive results to get the true positive rate.

The best F-score is 1 where the test delivers the perfect precision and recall, while the worst F-score is 0 which indicates the lowest precision and lowest recall.

Accuracy

Accuracy (of classification) of a predictive model is the ratio of the total number of correct predictions made by the model to the total predictions made. Thus,

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where,

TP, TN, FP, and FN indicate True Positives, True Negatives, False Positives, and False Negatives respectively.

Multiplying this ratio by 100 converts it into a percentage. Model classifiers have a 100% accuracy which is not possible.

Many times, classification accuracy can be misleading. Hence it is desirable to select a model with a slightly lower accuracy because it has a higher predictive power in the classification problem.

Confusion Matrix

Confusion matrix (also called the error matrix) is a summary of the results of the predictions on a classification problem. It is used to calculate the performance of a classifier on a set of testing data where the true values are already known.

A confusion matrix helps us to visualize the performance of a classification algorithm.

Table of Contents