Trying to use HybridIndexer for Label Indexing, run into issue where TrieWrapper has no attribute '_sorted'
preetbawa opened this issue · 14 comments
Description
Trying to leverage XLinear Model for Autcomplete suggestion model for our use case, Trie Plus Hierarichal Clustering makes sense for use case, so we are using HybridIndexer method, and it runs into error building the cluster/s.
How to Reproduce?
For reasons of compliance, I can't put the data here, but idea is to create pandas Dataframe with 3 columns
a) prev_query, prefix, and next_query. (next_query is the label) - easy to create dummy pandas dataframe with this data
b) here search_session_training_set_sorted is the pandas df with "previous_query", "prefix", and "next_query" columns and dataframe is sorted by the label column "next_query"
Build one hot encoded label matrix wrapped in scipy csr matrix.
label_y_ohe_matrix = csr_matrix(pd.get_dummies(search_session_training_set_sorted["next_query"]).values).astype(np.float32)
Build unique set of label_strs sorted for Trie part of Indexing.
labels_unique = set(search_session_training_set_sorted["next_query"].values.flatten())
labels_unique_sorted = sorted(labels_unique)
Build prefix tf-idf position weighted char level vectorizer and get actual tfidf vectors for each prefix.
input_x_prefix_list = search_session_training_set_sorted["prefix"].tolist()
tf_idf_prefix_vectorizer = PositionProductTfidf(analyzer="char", ngram_range=(1,2), dtype=np.float32, strip_accents="unicode")
input_prefix_matrix = tf_idf_prefix_vectorizer.fit_transform(input_x_prefix_list)
Build previous_query tf-idf word level vectorizer and get back tf-idf vectors for all previous_query terms.
input_prev_query_list = search_session_training_set_sorted["previous_query"].tolist()
tf_idf_prev_query_vectorizer = TfidfVectorizer(analyzer="word", ngram_range=(1,1), dtype=np.float32, strip_accents="unicode")
input_prev_query_matrix = tf_idf_prev_query_vectorizer.fit_transform(input_prev_query_list)
Horizontally stack the previous query and prefix horizontally as one input feature matrix (csr format)
input_feature_matrix = normalize(smat.hstack([input_prev_query_matrix, input_prefix_matrix]), "l2", axis=1)
Build label features using PIFA Embedding method.
label_features = csr_matrix(LabelEmbeddingFactory.create(
label_y_ohe_matrix,
input_feature_matrix,
method="pifa"), dtype=sp.float32)
Do label indexing using HybridIndexer strategy
cluster_matrix = HybridIndexer.gen(feat_mat=label_features,
label_strs=labels_unique_sorted,
depth=2,
max_leaf_size=100,
seed=0,
max_iter=20,
spherical_clustering=True
)
this last command above generates error like this below
07/11/2023 16:54:16 - INFO - py4j.java_gateway - Received command c on object id p1
07/11/2023 16:54:16 - INFO - main - Starting Hybrid-Trie Indexing
07/11/2023 16:54:16 - INFO - main - Added all labels to trie. Now building trie till depth = 2
in build_cluster_chain(self, depth)
79 def build_cluster_chain(self, depth):
80
---> 81 cluster_chain = self._build_sparse_cluster_chain_helper(depth=depth)
82
83 assert len(cluster_chain) == depth + 1
in _build_sparse_cluster_chain_helper(self, depth)
162 par_child_smat = smat.coo_matrix(np.ones((self.n_children, 1)))
163
--> 164 for child_char, child_trie in self.get_children():
165 child_cluster_chain = child_trie._build_sparse_cluster_chain_helper(depth=depth - 1)
166 all_cluster_chains += [child_cluster_chain]
in get_children(self)
29 child_trie._root = child_root
30 assert isinstance(child_trie._root, pygtrie._Node)
---> 31 child_trie._sorted = self._sorted
32 yield child_char, child_trie
33 elif isinstance(self._root.children, pygtrie._OneChild):
AttributeError: 'TrieWrapper' object has no attribute '_sorted'
Environment
- Operating system: Databricks Cluster Version 11.3 LTS ML
- Python version: 3.9
- PECOS version: mainline branch
(Add as much information about your environment as possible, e.g. dependencies versions.)
I would appreciate feedback on this matter as we are blocked to use HybridIndexing, from what i can tell this attribute is not really used - this code in pecos is in examples path examples/qp2q/models/indices.py
The bug is a result of pygtrie
version mismatch. This code used version 2.4.2 (as indicated in the requirements file) but the newer version of pygtrie
(from 2.5.0 onwards) introduced a small change in the base Trie
class in pygtrie
package.
In version 2.4.2, Trie
class has an attribute _sorted
(This _sorted
variable controls whether the Trie
children nodes are iterated in a sorted order or not.)
In version 2.5.0, this has been replaced with self._iteritems
which points to a function that returns a sorted/unsorted list of items.
So there can be two solutions:
- Use
pygtrie
version 2.4.2. - Replace
child_trie._sorted = self._sorted
withchild_trie.enable_sorting(self._iteritems is self._ITERITEMS_CALLBACKS[1])
in line 43 and line 50 of https://github.com/amzn/pecos/blob/mainline/examples/qp2q/models/indices.py.
Hope this will help resolve the issue!
The bug is a result of
pygtrie
version mismatch. This code used version 2.4.2 (as indicated in the requirements file) but the newer version ofpygtrie
(from 2.5.0 onwards) introduced a small change in the baseTrie
class inpygtrie
package. In version 2.4.2,Trie
class has an attribute_sorted
(This_sorted
variable controls whether theTrie
children nodes are iterated in a sorted order or not.) In version 2.5.0, this has been replaced withself._iteritems
which points to a function that returns a sorted/unsorted list of items.So there can be two solutions:
- Use
pygtrie
version 2.4.2.- Replace
child_trie._sorted = self._sorted
withchild_trie.enable_sorting(self._iteritems is self._ITERITEMS_CALLBACKS[1])
in line 43 and line 50 of https://github.com/amzn/pecos/blob/mainline/examples/qp2q/models/indices.py.Hope this will help resolve the issue!
thanks Nitin for your response, so what i did to bypass before your response was to do the following:
add init method in TrieWrapper
def init(self, *args, **kwargs):
super().init(*args, **kwargs)
self._sorted = None
and then whereever child_trie._sorted was been assigned in the code i just hardcoded to True, i am curious what's impact of traversing children in sorted order or not , especially we are building autocomplete solution as well.
irrespective of hardcoding _sorted, will try out your suggestion, thanks so much.
Nitin, i have another question, once we build clusters using Hybrid Indexing, how can i visualize those clusters - hierarchical clusters, i want to see which label embeddings are in same cluster, also do label strs as well go into those clusters - how can i compare what set of labels end up in same cluster ?
Second question: i am trying to follow example in code path examples/qp2q/models/pecosq2q.py
i am not sure why this is been done in this if else logic
i initially build OneHotEncoding of labels and then convert to csr matrix which is Y here for us, then i am trying to use PIFA embedding with input feature matrix as X
do i need to do this part " y[y > 0] = 1. "
line 514 - 519
if self.weighted_pifa:
label_features = LabelEmbeddingFactory.pifa(X=X, Y=y)
y[y > 0] = 1
else:
y[y > 0] = 1.
label_features = LabelEmbeddingFactory.pifa(X=X, Y=y)
thanks
i am curious what's impact of traversing children in sorted order or not , especially we are building autocomplete solution as well
I think the default value of _sorted
variable is False in v2.4.2. Setting _sorted to true or false should not make a difference in this code because the query strings are sorted (see line 200 ) before inserting them in the trie so both sorted and unsorted order of child nodes should be the same.
I think it is important to have a consistent order for iterating over trie nodes so that columns of final cluster matrix correspond to the right query.
how can i compare what set of labels end up in same cluster
See Line 26 for more details.
The d
th cluster matrix is of shape: n_{d+1} x n_{d}
. If (i,j)
-entry is non-zero then it means that node i
in level d+1
is a child-node of node j
in level d
. Each row contains exactly one non-zero entry as each node can have exactly one parent.
So, to find which nodes are in cluster j
at a given level, look at all rows that have non-zero entries in column j
of the corresponding cluster matrix.
line 514 - 519
y
is count vector storing number of times a label occurred with datapoint x
.
For weighted_pifa
- feature vectors are computed using a count-weighted aggregation. And once the label features are computed, all the count information is overwritten and y
just contains 0/1.
If unweighted average is required, then y
the count information is removed from y
by converting it to 0/1 vector before computing label_features
. See paper for more details on pifa
method for computing label_features
If y
is already a 0/1 vector, then this if-else will not make any difference in your use-case.
One more question about saving model - i was able to successfully train model(atleast not shape errors or other errors)
this is code snippet i used
from pecos.xmc.xlinear.model import XLinearModel
**xlinear_model = XLinearModel.train(
input_feature_matrix,
labels_y_ohe_matrix,
cluster_matrix,
threads=16,
Cp=1.0,
Cn=1.0,
threshold=0.1
)**
xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/")
but when i try to save model to disk as shown above i get following error:
INFO - pecos.xmc.base - Training Layer 0 of 3 Layers in HierarchicalMLModel, neg_mining=tfn..
07/17/2023 17:22:53 - INFO - py4j.java_gateway - Received command c on object id p0
07/17/2023 17:22:53 - INFO - pecos.xmc.base - Training Layer 1 of 3 Layers in HierarchicalMLModel, neg_mining=tfn..
07/17/2023 17:22:53 - INFO - pecos.xmc.base - Training Layer 2 of 3 Layers in HierarchicalMLModel, neg_mining=tfn..
07/17/2023 17:22:54 - INFO - py4j.java_gateway - Received command c on object id p0
OSError: [Errno 95] Operation not supported
stack trace:
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/utils/smat_util.py in save_matrix(tgt, mat)
95 elif isinstance(mat, smat.spmatrix):
---> 96 smat.save_npz(tgt_file, mat, compressed=False)
97 else:
/databricks/python/lib/python3.9/site-packages/scipy/sparse/_matrix_io.py in save_npz(file, matrix, compressed)
71 else:
---> 72 np.savez(file, **arrays_dict)
73
<array_function internals> in savez(*args, **kwargs)
/databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in savez(file, *args, **kwds)
616 """
--> 617 _savez(file, args, kwds, False)
618
/databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in _savez(file, args, kwds, compress, allow_pickle, pickle_kwargs)
719 with zipf.open(fname, 'w', force_zip64=True) as fid:
--> 720 format.write_array(fid, val,
721 allow_pickle=allow_pickle,
/usr/lib/python3.9/zipfile.py in close(self)
1169 self._fileobj.write(self._zinfo.FileHeader(self._zip64))
-> 1170 self._fileobj.seek(self._zipfile.start_dir)
1171
OSError: [Errno 95] Operation not supported
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
OSError: [Errno 95] Operation not supported
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
in <cell line: 15>()
13 )
14
---> 15 xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/")
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/xlinear/model.py in save(self, model_folder)
101 with open(f"{model_folder}/param.json", "w", encoding="utf-8") as fout:
102 fout.write(json.dumps(param, indent=True))
--> 103 self.model.save(path.join(model_folder, "ranker"))
104
105 @classmethod
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder)
1317 for d in range(self.depth):
1318 local_folder = f"{folder}/{d}.model"
-> 1319 self.model_chain[d].save(local_folder)
1320
1321 @classmethod
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder)
789 with open("{}/param.json".format(folder), "w") as f:
790 f.write(json.dumps(param, indent=True))
--> 791 smat_util.save_matrix("{}/W.npz".format(folder), self.W)
792 smat_util.save_matrix("{}/C.npz".format(folder), self.C)
793
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/utils/smat_util.py in save_matrix(tgt, mat)
96 smat.save_npz(tgt_file, mat, compressed=False)
97 else:
---> 98 raise NotImplementedError("Save not implemented for matrix type {}".format(type(mat)))
99
100
OSError: [Errno 95] Operation not supported
i wonder if above error is related to again some versioning problem with use on databricks with different python version etc
Does it say what is the type of the matrix? Perhaps @rofuyu or @OctoberChang might be able to help with this as it looks like an issue with core pecos functionality?
let me check why i don't see that error where it shows 'this type not supported' and describes the type, code is there with RaiseNotImplemented but it doesn' show up in logging in databricks
matrix W, and C under model_chain element are both sparse csr matrices , why its having issues, is it some other member which is causing a problem ?
@rofuyu or @OctoberChang can you guys please shed light on this issue, its blocking us from saving model to disk.
@preetbawa , I know that you checked that the matrices W and C are sparse_csr matrices but can you share more details about the exact exception being raised here? What exactly does the exception message from Line 98 say? This error would not be raised if the matrix being saved was a scipy sparse_csr matrix.