How to Build and Evaluate a Decision Tree Classifier in Python

In this article, you will gain a deep understanding of the Decision Tree Classifier. Furthermore, it will guide you through steps for building and evaluating Decision Tree Classifer and techniques to improve its performance. By the end of this article, you will not only learn what is Decision Tree Classifier but also learn how to implement, optimize and evaluate it.

Decision Tree

It is a supervised learning algorithm used for both classification and regression. In fact it mimics the human decision-making process and breakdowns the data into tree like structure. It splits dataset into subgroups based on feature values and algortihm selects best feature that distinguish the data. It is a tree like structure including –

  • nodes – to represent attributes,
  • edges – to represent decision rules
  • leaves – to represent outcomes.

Steps to build and evaluate Decision Tree Classifier

Step – 1. Import Libraries

  • First we will import all the necessary libraries. For example – numpy and pandas for data handling, matplotlib and seaborn for visualisation, sklearn.model_selection for splitting the dataset, sklearn.tree for visualizing the decision tree and sklearn.metrics for evaluating classifier.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Step – 2. Load Dataset

  • Then we load the dataset to train the decision tree.
from sklearn.datasets import load_iris
dataset=load_iris()
X = dataset.data
y = dataset.target
df = pd.DataFrame(X, columns=dataset.feature_names)
df['target'] = y

Step – 3. Split Data into training and testing sets

  • Also we split the dataset into training and testing sets. The test_size=0.2 indicates that we will use 20% of the data for training.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

Step – 4. Train a Decision Tree Classifier

  • Here, we first create an instance of the Decision Tree Classifier and then train it using the training data. In the code below, we use gini impurity as the splitting criterion for dividing the data.
classifier = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
classifier.fit(X_train, y_train)

Step – 5. Visualize Decision Tree

  • After training the model, we can visualise the Decision Tree structure using plot_tree().
plt.figure(figsize=(12,6))
plot_tree(classifier, feature_names=dataset.feature_names, class_names=dataset.target_names, filled=True)
plt.show()

Step – 6. Making Predictions

  • Then predictions are made on test data using the trained model.
y_pred = classifier.predict(X_test)

Step – 7. Model Evaluation

  • Finally, we evaluate the model by finding accuracy score, generating classification report and analyzing confusion matrix. This helps us know how well the model classifies the data.
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy: .2f}")

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=dataset.target_names))

matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(matrix, annot=True, fmt='d', cmap='Blues',xticklabels=dataset.target_names, yticklabels=dataset.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Improving Model Performance

Below methods can enhance Decision Tree performance:-

  • Pruning – It reduces depth of the tree and hence helps to avoid overfitting.
  • Hyperparameter Tuning – Some hyperparameters used to fine-tune decision tree are max_depth, min_samples_split, max_features.
  • Ensemble Methods – Methods like Random Forest, XGBoost combines multiple decision tree and therefore provides better accuracy.

Also read,

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top