In this article, you will gain a deep understanding of the Decision Tree Classifier. Furthermore, it will guide you through steps for building and evaluating Decision Tree Classifer and techniques to improve its performance. By the end of this article, you will not only learn what is Decision Tree Classifier but also learn how to implement, optimize and evaluate it.
Decision Tree
It is a supervised learning algorithm used for both classification and regression. In fact it mimics the human decision-making process and breakdowns the data into tree like structure. It splits dataset into subgroups based on feature values and algortihm selects best feature that distinguish the data. It is a tree like structure including –
- nodes – to represent attributes,
- edges – to represent decision rules
- leaves – to represent outcomes.
Steps to build and evaluate Decision Tree Classifier
Step – 1. Import Libraries
- First we will import all the necessary libraries. For example – numpy and pandas for data handling, matplotlib and seaborn for visualisation, sklearn.model_selection for splitting the dataset, sklearn.tree for visualizing the decision tree and sklearn.metrics for evaluating classifier.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Step – 2. Load Dataset
- Then we load the dataset to train the decision tree.
from sklearn.datasets import load_iris dataset=load_iris() X = dataset.data y = dataset.target df = pd.DataFrame(X, columns=dataset.feature_names) df['target'] = y
Step – 3. Split Data into training and testing sets
- Also we split the dataset into training and testing sets. The test_size=0.2 indicates that we will use 20% of the data for training.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
Step – 4. Train a Decision Tree Classifier
- Here, we first create an instance of the Decision Tree Classifier and then train it using the training data. In the code below, we use gini impurity as the splitting criterion for dividing the data.
classifier = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42) classifier.fit(X_train, y_train)
Step – 5. Visualize Decision Tree
- After training the model, we can visualise the Decision Tree structure using plot_tree().
plt.figure(figsize=(12,6)) plot_tree(classifier, feature_names=dataset.feature_names, class_names=dataset.target_names, filled=True) plt.show()
Step – 6. Making Predictions
- Then predictions are made on test data using the trained model.
y_pred = classifier.predict(X_test)
Step – 7. Model Evaluation
- Finally, we evaluate the model by finding accuracy score, generating classification report and analyzing confusion matrix. This helps us know how well the model classifies the data.
accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy: .2f}") print("Classification Report:") print(classification_report(y_test, y_pred, target_names=dataset.target_names)) matrix = confusion_matrix(y_test, y_pred) plt.figure(figsize=(6,4)) sns.heatmap(matrix, annot=True, fmt='d', cmap='Blues',xticklabels=dataset.target_names, yticklabels=dataset.target_names) plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.show()
Improving Model Performance
Below methods can enhance Decision Tree performance:-
- Pruning – It reduces depth of the tree and hence helps to avoid overfitting.
- Hyperparameter Tuning – Some hyperparameters used to fine-tune decision tree are max_depth, min_samples_split, max_features.
- Ensemble Methods – Methods like Random Forest, XGBoost combines multiple decision tree and therefore provides better accuracy.
Also read,