poloclub/timbertrek

Issue with timbertrek.transform_trie_to_rules

sarah-huestis opened this issue · 5 comments

Trying to convert a trie calculated by the treeFARMS package into a rules JSON for timbertrek using code suggested in another ticket (#2), but getting an error. The trie has 80739 trees according to the message.

Code:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from treefarms.model.threshold_guess import compute_thresholds, cut
from treefarms import TREEFARMS
from treefarms.model.model_set import ModelSetContainer
import timbertrek

X = data
# h = list(X.columns)
for met in metric_cols:
    y = df[met]
    config = {
    "regularization": 0.02,  # regularization penalizes the tree with more leaves. We recommend to set it to relative high value to find a sparse tree.
    "rashomon_bound_multiplier": 0.25, "depth_budget": 0}  # rashomon bound multiplier indicates how large of a Rashomon set would you like to get}
    model = TREEFARMS(config)
    print('configed')
    model.fit(X, y)
    print('fitted')
    # Get the rashomon in a trie structure
    trie = model.model_set.to_trie()
    print('trie-d')
    df = model.dataset
    # Convert the trie to decision paths
    feature_names = df.columns
    decision_paths = timbertrek.transform_trie_to_rules(trie,df,feature_names=feature_names)
    # Save the decision paths in a JSON file
    dump(decision_paths, open('tree_for_'+str(met)+'.json', 'w'))

Error:

IndexError Traceback (most recent call last)
/tmp/ipykernel_16824/365095990.py in
26 trie,
27 df,
---> 28 feature_names=feature_names,
29 )
30 # Save the decision paths in a JSON file

/opt/conda/lib/python3.7/site-packages/timbertrek/timbertrek.py in transform_trie_to_rules(trie, data_df, feature_names, feature_description)
683 # Construct trees
684 decision_rule_hierarchy, tree_map = get_decision_rule_hierarchy_dict(
--> 685 trie, keep_position=False
686 )
687 new_tree_map = get_tree_map_hierarchy(tree_map)

/opt/conda/lib/python3.7/site-packages/timbertrek/timbertrek.py in get_decision_rule_hierarchy_dict(trie, keep_position)
483 for i in tree_map["map"]:
484 cur_string = tree_map["map"][i][0]
--> 485 all_rules = get_decision_rules(cur_string)
486
487 # Iterate the set and build the hierarchy dict

/opt/conda/lib/python3.7/site-packages/timbertrek/timbertrek.py in get_decision_rules(tree_strings)
238 cur_feature, pre_features = working_queue.popleft()
239
--> 240 cur_string = tree_strings[i]
241 cur_string_split = cur_string.split()
242

IndexError: list index out of range

Including screenshots.
image
image
image

Hi @sarah-huestis, hmm, it is a bit unusual to have -9 and -3 in the trie strings. Is it possible to save trie dictionary as a JSON file and share the file with us (e.g., post it as a private gist)?

It seems -9 and -3 comes from here (thanks for the help @zbw8388!) :

https://github.com/ubc-systopia/treeFarms/blob/d6d1363d1d4a4cf4bc768c98a481b6b8f484a153/treefarms/model/model_set.py#L31-L47

TimberTrek only supports binary prediction, is your model a multi-class classifier with labels 2 and 8. Maybe try to make sure the labels are either 0 or 1?

Oh I wasn't aware it didn't support multi-class classification! Thanks for checking it out!

Yup. Supporting multi-class is actually not too difficult—I will look into it when I have more bandwidth. A note to myself or people who want to contribute:

  1. Support multi-class in trie parsing
  2. Compute multi-class accuracies during trie parsing
  3. Change the +/- signs to class labels in the visualization