Implementation of Decision Trees in Python

Introduction

In this blog, we would discuss the Implementation of Decision Trees in Python. When it comes to Machine Learning, few models are as popular or widely used as Decision Trees. Decision Trees are a simple yet powerful tool for classification and regression tasks and are well suited for a variety of data sets. At its core, a Decision Tree is a flowchart-like structure, with each node representing a decision point, and each branch representing the possible outcomes of that decision. Training a Decision Tree involves starting at the root node and working our way down the tree, making decisions at each node until we reach a leaf node (i.e., a node with no children). The path we take from the root to the leaf node is called a decision path.

 

Once a Decision Tree has been trained, it can be used to make predictions on new data points by following the decision path from the root node to a leaf node. The leaf node will contain the predicted label (for classification tasks) or value (for regression tasks). Decision trees are a powerful tool for both classification and regression tasks. In a classification task, we are trying to predict a label for a data point, while in a regression task, we are trying to predict a continuous value.

 

 

 

Implementation of Decision Trees

Initially, we Import all the required Libraries

# inbuilt decision tree model

from sklearn.tree import DecisionTreeClassifier

# dataset from sklearn
from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

# loading the dataset
cancer = load_breast_cancer()

 

Fitting the dataset to the model

 

X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)

# getting the model
tree = DecisionTreeClassifier(random_state=0)

# fitting the model with dataset
tree.fit(X_train, y_train)

# printing the accuracy
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

 

Getting the accuracy of the model with a maximum depth of 2

 

# Fitting with max depth of 2

tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(X_train, y_train)

# Accuracy of training and testing dataset
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

 

Visualization of Decision Trees

 

# used for visualization

from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
feature_names=cancer.feature_names, impurity=False, filled=True)

 

import graphviz

# plotting the output

with open("tree.dot") as f:
  dot_graph = f.read()
graphviz.Source(dot_graph)

 

 

 

 

Advantages of Decision Trees

 

  1. Simple to understand and interpret. Trees can be visualized.
  2. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
  3. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  4. Able to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.

 

 

Disadvantages of Decision Trees

Decision Trees are a powerful tool for Machine Learning, but they are not without their drawbacks.

 

  1. One main disadvantage of Decision Trees is that they can be very sensitive to small changes in the data. This is due to the fact that the decision path from the root node to a leaf node is determined by the values of the features at each node.
  2. Another drawback of Decision Trees is that they can be prone to overfitting, especially when the data set is relatively small. Overfitting is when a model captures the noise in the data instead of the signal. This can lead to poor generalization performance on new data points.
  3. Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation

 

 

 

Also, read – Implementation Of Kmeans Clustering in Python

 

Share this post

2 thoughts on “Implementation of Decision Trees in Python

Leave a Reply

Your email address will not be published. Required fields are marked *