Before you begin, perform the following two steps:
For this lab, you're going implement the ID3 algorithm to build a decision tree based on a dataset.
Your assignment is to build a program that reads a dataset in ARFF format, and construct a decision tree using the ID3 algorithm.
Optionally for extra credit, you are also instructed to create a (G)UI to
query the decision tree. You are free to choose the format of this program.
Create a separate script called query
if you decide to make a program for
this. (If the script query
is not present in your submission, it will be
assumed that you didn't do this part.)
Your program must read from stdin
and output to stdout
. See
here
for instructions how to do this.
Basically, your program will be called by executing run < [problem_file]
,
where problem_file
is a file describing the problem (see below). Your program
should output the solution as text to the terminal.
IMPORTANT: A number of testcases are provided in the testcase directory.
Make sure your program gives the exact same output on the testcases. Use
autograder or the Unix diff
utility to check this yourself before submitting.
The input is given in the ARFF format. For details, see
here. For examples, see the
testcase
directory.
- All attributes will be nominal (categories), not continuous.
- The final attribute is the output field.
- All other attributes are input.
Output consists of the hierarchical tree, one node per line. A node is either an internal node (a node with one or more children), or a leaf node.
These nodes should be shown in your output as follows:
- A leaf node is an ANSWER node, and should be shown in your output as:
ANSWER: [output]
. This is saying that the result of following the path gives the answeroutput
(for exampleyes
,no
,maybe
). - An internal node is a decision node:
[attribute]: [value]
. For example:outlook: sunny
. This means that the attributeoutlook
is set tosunny
.
There are a number of extra requirements:
- All nodes need to be indented with
2 * depth
spaces. So the decision nodes at root level are indented with 0 spaces, and an leaf node at depth 10 with 20 spaces. - Decision nodes need to be ordered based on the occurrence of their value in
the ARFF file. For example, if the attribute
outlook
appears in the ARFF file as@attribute outlook {sunny, overcast, rainy}
, then the decision nodes for that attribute should appear in the ordersunny
,overcast
andrainy
.
The following example encodes a very simple dataset for the OR function:
@relation or
@attribute A {TRUE, FALSE}
@attribute B {TRUE, FALSE}
@attribute AorB {TRUE, FALSE}
@data
TRUE,TRUE,TRUE
FALSE,FALSE,FALSE
TRUE,FALSE,TRUE
FALSE,TRUE,TRUE
Output:
A: TRUE
ANSWER: TRUE
A: FALSE
B: TRUE
ANSWER: TRUE
B: FALSE
ANSWER: FALSE
More examples can be found in the testcases
directory