bug random walk
boersmamarcel opened this issue · 20 comments
when I run
from NetEmbs.FSN import *
randomWalk(fsn, 1, length=10, direction="COMBI")
I get
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-95ca35af2a0e> in <module>()
1 from NetEmbs.FSN import *
----> 2 randomWalk(fsn, 1, length=10, direction="COMBI")
/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in randomWalk(G, vertex, length, direction, version, return_full_path, debug)
255 elif version == "MetaDiff":
256 if direction is "COMBI":
--> 257 new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
258 cur_direction = mask[cur_direction]
259 else:
/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in step(G, vertex, direction, mode, allow_back, return_full_step, pressure, debug)
146 return vertex
147 elif not G.has_node(vertex):
--> 148 raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
149 if direction == "IN":
150 ins = G.in_edges(vertex, data=True)
ValueError: Vertex 1 is not in FSN!
any idea how I can fix this? Do you need more info? Then please let me know.
What are your BP IDs? Might be your data simply includes something like {2, 3, 5 etc} and no BP with ID equal 1, hence, you are getting an error. The arguments of function are the following:
def randomWalk(G, vertex=None, length=3, direction="IN", version="MetaDiff", return_full_path=False, debug=False):
"""
RandomWalk function for sampling the sequence of nodes from given graph and initial node
:param G: Bipartite graph, an instance of networkx
:param vertex: initial node
:param length: the maximum length of RandomWalk
:param direction: The direction of walking. IN - go via source financial accounts, OUT - go via target financial accounts
:param version: Version of step:
"DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
"DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
"MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
"MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
"MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
:param return_full_path: If True, return the full path with FA nodes
:param debug: Debug boolean flag, print intermediate steps
:return: Sampled sequence of nodes
"""
The set of BPs nodes could be get with the following method of FSN class:
fsn.get_BP()
yes, I have many weird business process IDS some are numbers and some are combinations of numbers and letters.
Then you need to use your actual BP ID for sampling a sequence from fsn
vertex=None
something like
from NetEmbs.FSN import *
randomWalk(fsn, "my_long_name123", length=10, direction="COMBI")
ok, so I just need to give the first item of the list
it seems to work! :)
I did a couple of test runs and at a quick glance things look good; I think the combi strategy gives the best results because others only match input or output; the all strategy I didn’t evaluate thoroughly yet. How is building the skipgram going?!
I get the following
Fatal ValueError during step Traceback (most recent call last): File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/FSN/utils.py", line 203, in step tmp_vertex = np.random.choice(outs, p=probas) File "mtrand.pyx", line 1144, in mtrand.RandomState.choice ValueError: probabilities contain NaN Fatal ValueError during step
@AlexWorldD any idea?
Hi!
What are the last rows in log file? It should write the current node etc. if it gets an exception.
except Exception as e:
if LOG:
snapshot = {"CurrentNode": tmp_vertex, "CurrentWeight": tmp_weight,
"NextCandidates": list(zip(outs, ws)), "Probas": probas}
local_logger = logging.getLogger("NetEmbs.Utils.step")
local_logger.error("Fatal ValueError during step", exc_info=True)
local_logger.info("Snapshot" + str(snapshot))
Now we fill NA values during split_to_debit_credit() function, so, might be it's better to do it in a separate way?
df.fillna(0.0, inplace=True)
Yes, that's the problem. You don't split data -> we don't execute that part of the code. One moment
Ok, at least now the input DataFrame is preprocessed with fillna() method. So, I guess the error has been fixed.
Alex
Hmm, I should be fixed already. Now it combines different colors with different markers.
def plot_tSNE(fsn_embs, title="tSNE", rand_state=1, manual=False):
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
tsne = TSNE(random_state=rand_state)
embdf = pd.DataFrame(list(map(np.ravel, fsn_embs.iloc[:, 1])))
embed_tsne = tsne.fit_transform(embdf)
fsn_embs["x"] = pd.Series(embed_tsne[:, 0])
fsn_embs["y"] = pd.Series(embed_tsne[:, 1])
import seaborn as sns
markers = ["o", "v", "s"]
cur_m=0
if manual:
plt.clf()
n_gr = 0
for name, group in fsn_embs.groupby("FA_Name"):
n_gr+=1
if n_gr>3:
cur_m = cur_m+1 if len(markers)-1>cur_m else 0
n_gr=0
plt.scatter(group["x"].values, group["y"].values, s=150, marker=markers[cur_m], label=name)
# sns.scatterplot(data=fsn_embs, x="x", y="y", hue="FA_Name", s=150)
plt.legend(bbox_to_anchor=(1.3, 1), loc="upper right", frameon=False, markerscale=1)
else:
fg = sns.FacetGrid(data=fsn_embs, hue='FA_Name', aspect=1.61, height=6, legend_out=True)
fg.map(pyplot.scatter, 'x', 'y')
fg.add_legend()
if title is not None and isinstance(title, str):
plt.tight_layout()
plt.savefig("img/" + title, dpi=140, pad_inches=0.01)
plt.show()
return fsn_embs
def set_font(s, reset=False):
if reset:
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams["figure.figsize"] = [20,10]
# plt.rcParams['font.family'] = 'serif'
# plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']
plt.rc('font', size=s) # controls default text sizes
plt.rc('axes', titlesize=s) # fontsize of the axes title
plt.rc('axes', labelsize=s) # fontsize of the x and y labels
plt.rc('xtick', labelsize=s-2) # fontsize of the tick labels
plt.rc('ytick', labelsize=s-2) # fontsize of the tick labels
plt.rc('legend', fontsize=s) # legend fontsize
plt.rc('figure', titlesize=s) # fontsize of the figure title
rand_seed = 2
set_font(20)
_ = plot_tSNE(res, "FastTrain10k", rand_seed, manual=True)
@AlexWorldD I tried again but I keep receiving:
Traceback (most recent call last):
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/FSN/utils.py", line 203, in step
tmp_vertex = np.random.choice(outs, p=probas)
File "mtrand.pyx", line 1144, in mtrand.RandomState.choice
ValueError: probabilities contain NaN
Fatal ValueError during step
Ok, that's weird. What is in log file? logs.log in your project directory
from NetEmbs.Logs.custom_logger import log_me
MAIN_LOGGER = log_me()
MAIN_LOGGER.info("Started..")
I found one entry again and noticed the following:
Single journal entry A->B where all amounts in that entry are zero. Thus
name | debit | credit |
---|---|---|
a | 0 | 0 |
b | 0 | 0 |
These are small errors in the data itself, we can filter these transactions in the data preparation step.
That case should be captured as NaNs during normalization procedure - dividing by zero, hence, the current version of prepare_data function should work
if norm:
original_df = normalize(original_df)
# Remove rows with NaN values after normalization (e.g. when all values were 0.0 -> something/zero leads to NaN)
original_df.dropna(subset=["Debit", "Credit"], inplace=True)
But again, it works for my test cases... Hope it'll be also OK for real data...