Issue with timbertrek.transform_trie_to_rules
sarah-huestis opened this issue · 5 comments
Trying to convert a trie calculated by the treeFARMS package into a rules JSON for timbertrek using code suggested in another ticket (#2), but getting an error. The trie has 80739 trees according to the message.
Code:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from treefarms.model.threshold_guess import compute_thresholds, cut
from treefarms import TREEFARMS
from treefarms.model.model_set import ModelSetContainer
import timbertrek
X = data
# h = list(X.columns)
for met in metric_cols:
y = df[met]
config = {
"regularization": 0.02, # regularization penalizes the tree with more leaves. We recommend to set it to relative high value to find a sparse tree.
"rashomon_bound_multiplier": 0.25, "depth_budget": 0} # rashomon bound multiplier indicates how large of a Rashomon set would you like to get}
model = TREEFARMS(config)
print('configed')
model.fit(X, y)
print('fitted')
# Get the rashomon in a trie structure
trie = model.model_set.to_trie()
print('trie-d')
df = model.dataset
# Convert the trie to decision paths
feature_names = df.columns
decision_paths = timbertrek.transform_trie_to_rules(trie,df,feature_names=feature_names)
# Save the decision paths in a JSON file
dump(decision_paths, open('tree_for_'+str(met)+'.json', 'w'))
Error:
IndexError Traceback (most recent call last)
/tmp/ipykernel_16824/365095990.py in
26 trie,
27 df,
---> 28 feature_names=feature_names,
29 )
30 # Save the decision paths in a JSON file
/opt/conda/lib/python3.7/site-packages/timbertrek/timbertrek.py in transform_trie_to_rules(trie, data_df, feature_names, feature_description)
683 # Construct trees
684 decision_rule_hierarchy, tree_map = get_decision_rule_hierarchy_dict(
--> 685 trie, keep_position=False
686 )
687 new_tree_map = get_tree_map_hierarchy(tree_map)
/opt/conda/lib/python3.7/site-packages/timbertrek/timbertrek.py in get_decision_rule_hierarchy_dict(trie, keep_position)
483 for i in tree_map["map"]:
484 cur_string = tree_map["map"][i][0]
--> 485 all_rules = get_decision_rules(cur_string)
486
487 # Iterate the set and build the hierarchy dict
/opt/conda/lib/python3.7/site-packages/timbertrek/timbertrek.py in get_decision_rules(tree_strings)
238 cur_feature, pre_features = working_queue.popleft()
239
--> 240 cur_string = tree_strings[i]
241 cur_string_split = cur_string.split()
242
IndexError: list index out of range
Hi @sarah-huestis, hmm, it is a bit unusual to have -9
and -3
in the trie strings. Is it possible to save trie
dictionary as a JSON file and share the file with us (e.g., post it as a private gist)?
Thanks for the quick response! File here: https://gist.github.com/sarah-huestis/74627cae7cd6ef9f036d5b2d0dfbc1aa
It seems -9
and -3
comes from here (thanks for the help @zbw8388!) :
TimberTrek only supports binary prediction, is your model a multi-class classifier with labels 2
and 8
. Maybe try to make sure the labels are either 0
or 1
?
Oh I wasn't aware it didn't support multi-class classification! Thanks for checking it out!
Yup. Supporting multi-class is actually not too difficult—I will look into it when I have more bandwidth. A note to myself or people who want to contribute:
- Support multi-class in trie parsing
- Compute multi-class accuracies during trie parsing
- Change the
+/-
signs to class labels in the visualization