AlexWorldD/NetEmbs

bug random walk

boersmamarcel opened this issue · 20 comments

when I run

from NetEmbs.FSN import *
randomWalk(fsn, 1, length=10, direction="COMBI")

I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-95ca35af2a0e> in <module>()
      1 from NetEmbs.FSN import *
----> 2 randomWalk(fsn, 1, length=10, direction="COMBI")

/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in randomWalk(G, vertex, length, direction, version, return_full_path, debug)
    255             elif version == "MetaDiff":
    256                 if direction is "COMBI":
--> 257                     new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
    258                     cur_direction = mask[cur_direction]
    259                 else:

/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in step(G, vertex, direction, mode, allow_back, return_full_step, pressure, debug)
    146         return vertex
    147     elif not G.has_node(vertex):
--> 148         raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
    149     if direction == "IN":
    150         ins = G.in_edges(vertex, data=True)

ValueError: Vertex 1 is not in FSN!

any idea how I can fix this? Do you need more info? Then please let me know.

What are your BP IDs? Might be your data simply includes something like {2, 3, 5 etc} and no BP with ID equal 1, hence, you are getting an error. The arguments of function are the following:

def randomWalk(G, vertex=None, length=3, direction="IN", version="MetaDiff", return_full_path=False, debug=False):
    """
    RandomWalk function for sampling the sequence of nodes from given graph and initial node
    :param G: Bipartite graph, an instance of networkx
    :param vertex: initial node
    :param length: the maximum length of RandomWalk
    :param direction: The direction of walking. IN - go via source financial accounts, OUT - go via target financial accounts
    :param version: Version of step:
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param return_full_path: If True, return the full path with FA nodes
    :param debug: Debug boolean flag, print intermediate steps
    :return: Sampled sequence of nodes
    """

The set of BPs nodes could be get with the following method of FSN class:

fsn.get_BP()

yes, I have many weird business process IDS some are numbers and some are combinations of numbers and letters.

Then you need to use your actual BP ID for sampling a sequence from fsn

vertex=None

something like

from NetEmbs.FSN import *
randomWalk(fsn, "my_long_name123", length=10, direction="COMBI")

ok, so I just need to give the first item of the list

it seems to work! :)

I did a couple of test runs and at a quick glance things look good; I think the combi strategy gives the best results because others only match input or output; the all strategy I didn’t evaluate thoroughly yet. How is building the skipgram going?!

I get the following

Fatal ValueError during step Traceback (most recent call last): File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/FSN/utils.py", line 203, in step tmp_vertex = np.random.choice(outs, p=probas) File "mtrand.pyx", line 1144, in mtrand.RandomState.choice ValueError: probabilities contain NaN Fatal ValueError during step

Hi!
What are the last rows in log file? It should write the current node etc. if it gets an exception.

        except Exception as e:
            if LOG:
                snapshot = {"CurrentNode": tmp_vertex, "CurrentWeight": tmp_weight,
                            "NextCandidates": list(zip(outs, ws)), "Probas": probas}
                local_logger = logging.getLogger("NetEmbs.Utils.step")
                local_logger.error("Fatal ValueError during step", exc_info=True)
                local_logger.info("Snapshot" + str(snapshot))

Now we fill NA values during split_to_debit_credit() function, so, might be it's better to do it in a separate way?
df.fillna(0.0, inplace=True)

Yes, that's the problem. You don't split data -> we don't execute that part of the code. One moment

Ok, at least now the input DataFrame is preprocessed with fillna() method. So, I guess the error has been fixed.

Alex

Hmm, I should be fixed already. Now it combines different colors with different markers.

def plot_tSNE(fsn_embs, title="tSNE", rand_state=1, manual=False):
    import os
    os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    tsne = TSNE(random_state=rand_state)
    embdf = pd.DataFrame(list(map(np.ravel, fsn_embs.iloc[:, 1])))
    embed_tsne = tsne.fit_transform(embdf)
    fsn_embs["x"] = pd.Series(embed_tsne[:, 0])
    fsn_embs["y"] = pd.Series(embed_tsne[:, 1])
    import seaborn as sns
    markers = ["o", "v", "s"]
    cur_m=0
    if manual:
        plt.clf()
        n_gr = 0
        for name, group in fsn_embs.groupby("FA_Name"):
            n_gr+=1
            if n_gr>3:
                cur_m = cur_m+1 if len(markers)-1>cur_m else 0
                n_gr=0
            plt.scatter(group["x"].values, group["y"].values, s=150, marker=markers[cur_m], label=name)
#         sns.scatterplot(data=fsn_embs, x="x", y="y", hue="FA_Name", s=150)
        plt.legend(bbox_to_anchor=(1.3, 1), loc="upper right", frameon=False, markerscale=1)
    else:
        fg = sns.FacetGrid(data=fsn_embs, hue='FA_Name', aspect=1.61, height=6, legend_out=True)
        fg.map(pyplot.scatter, 'x', 'y')
        fg.add_legend()
    if title is not None and isinstance(title, str):
        plt.tight_layout()
        plt.savefig("img/" + title, dpi=140, pad_inches=0.01)
    plt.show()
    return fsn_embs
def set_font(s, reset=False):
    if reset:
        plt.rcParams.update(plt.rcParamsDefault)
    plt.rcParams["figure.figsize"] = [20,10]
#     plt.rcParams['font.family'] = 'serif'
#     plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']
    plt.rc('font', size=s)          # controls default text sizes
    plt.rc('axes', titlesize=s)     # fontsize of the axes title
    plt.rc('axes', labelsize=s)    # fontsize of the x and y labels
    plt.rc('xtick', labelsize=s-2)    # fontsize of the tick labels
    plt.rc('ytick', labelsize=s-2)    # fontsize of the tick labels
    plt.rc('legend', fontsize=s)    # legend fontsize
    plt.rc('figure', titlesize=s)  # fontsize of the figure title

rand_seed = 2
set_font(20)
_ = plot_tSNE(res, "FastTrain10k", rand_seed, manual=True)

@AlexWorldD I tried again but I keep receiving:

Traceback (most recent call last):
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-4/NetEmbs/FSN/utils.py", line 203, in step
    tmp_vertex = np.random.choice(outs, p=probas)
  File "mtrand.pyx", line 1144, in mtrand.RandomState.choice
ValueError: probabilities contain NaN
Fatal ValueError during step

Ok, that's weird. What is in log file? logs.log in your project directory

from NetEmbs.Logs.custom_logger import log_me
MAIN_LOGGER = log_me()
MAIN_LOGGER.info("Started..")

I found one entry again and noticed the following:

Single journal entry A->B where all amounts in that entry are zero. Thus

name debit credit
a 0 0
b 0 0

These are small errors in the data itself, we can filter these transactions in the data preparation step.

That case should be captured as NaNs during normalization procedure - dividing by zero, hence, the current version of prepare_data function should work

if norm:
        original_df = normalize(original_df)
    #     Remove rows with NaN values after normalization (e.g. when all values were 0.0 -> something/zero leads to NaN)
original_df.dropna(subset=["Debit", "Credit"], inplace=True)

But again, it works for my test cases... Hope it'll be also OK for real data...