Decision Trees in Python using Pandas and Networkx in Python

submission/

Decision trees are usually used in Machine Learning classification problems. This program allows you to provide the CSV file & output variable and returns a PNG file of the decision tree.

Installation:

$> pip install -r requirements.txt

How to run:

$> python3 main.py -f -o [--drop ] [--thresh ]

where csvfile = path to the CSV file containing data (+ headers),
output_col = column name which is the target decision variable,
drop_cols = list of columns to drop before generating tree,
threshold = tolerance of purity of terminal nodes of the tree

Program Details:

The program uses Entropy and Information Gain (IG) to decide upon the splitting attribute of the dataset at various levels of the tree. The attribute with the highest IG is selected as the node for splitting at that level.

The buildTree() function recursively generates the decision tree as a nested dictionary. It computes the IG for each attribute at that level and decides the splitting attribute based on IG.

IG(attribute) = (entropy of the dataset) - (entropy of attribute)entropy = -(f*log(f))

where f = fraction of elements in the split group before and after splitting

Networkx library is used to visualize the generated decision tree. Since the decision tree is passed on as a nested dictionary, its edges and labels are extracted separately and later plotted using Matplotlib library.

Coders Packet

Decision Trees in Python using Pandas and Networkx in Python

Comments