Decision trees are usually used in Machine Learning classification problems. This program allows you to provide the CSV file & output variable and returns a PNG file of the decision tree.
$> pip install -r requirements.txt
How to run:
$> python3 main.py -f -o [--drop ] [--thresh ]
csvfile = path to the CSV file containing data (+ headers),
output_col = column name which is the target decision variable,
drop_cols = list of columns to drop before generating tree,
threshold = tolerance of purity of terminal nodes of the tree
The program uses Entropy and Information Gain (IG) to decide upon the splitting attribute of the dataset at various levels of the tree. The attribute with the highest IG is selected as the node for splitting at that level.
The buildTree() function recursively generates the decision tree as a nested dictionary. It computes the IG for each attribute at that level and decides the splitting attribute based on IG.
IG(attribute) = (entropy of the dataset) - (entropy of attribute)
entropy = -(f*log(f))
f = fraction of elements in the split group before and after splitting
Networkx library is used to visualize the generated decision tree. Since the decision tree is passed on as a nested dictionary, its edges and labels are extracted separately and later plotted using Matplotlib library.