
PME (Process Mining Embeddings): A package to train and generate activity-level embeddings for process mining

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

PME (Process Mining Embeddings)

A package to train and generate activity-level embeddings for process mining



Install the PME package with pip

$> pip install pme


  • Pandas
  • Scikit-Learn
  • PM4PY
  • Torch
  • Gensim
  • KarateClub


The package has three main modules:


It allows to read datasets in .XES format, convert them to .CSV format, save them, display their main characteristics (number of events, unique activities and resources...), perform holdout or cross-validation partitioning and define the objects to store the eventlog data and use them with the rest of the package functions.


  • EventlogDataset: Class storing the training, validation and test partitions of a given dataset and its main features useful for training embeddings.
    • Attributes:
      • filename: Name of the dataset.
      • directory: Path to the folder where the dataset is stored.
      • df_train: Pandas DataFrame of the train partition of the dataset.
      • df_val: Pandas DataFrame of the validation partition of the dataset.
      • df_test: Pandas DataFrame of the test partition of the dataset.
      • num_activities: Number of unique activities in the eventlog.
      • num_resources: Number of unique resources in the eventlog.
    • Constructor parameters:
      • csv_path: Path to the .csv file with the full eventlog.
      • cv_fold: Number of fold if a cross-validation fold is read. Default: None.
      • read_test: Boolean indicating if read test split is necessary. Default: True.


  • make_holdout(csv_path: str, train_size: float = 80, val_size_from_train: float = 20, splits_path: str = None): Create the train-val-test splits and store them.
    • Parameters:
      • csv_path: Full path to the CSV file with the dataset.
      • train_size: Percentage of the data for the training partition (the test partition is the remaining percentage). Number between 1 and 100.
      • val_size_from_train: Percentage of the training partition reserved for validation. Number between 1 and 100.
      • splits_path: Full path where CSV splits will be written.
  • make_crossvalidation(csv_path: str, num_folds: int = 5, val_size_from_train: float = 20, splits_path: str = None, seed: int = 21): Create the k-fold cross-validation and store the folds.
    • Parameters:
      • csv_path: Full path to the CSV file with the dataset.
      • num_folds: Number of folds in the cross-validation.
      • val_size_from_train: Percentage of the training partition reserved for validation. Number between 1 and 100.
      • splits_path: Full path where CSV splits will be written.
      • seed: Seed to set the random state and get reproducibility.


  • get_num_cases(data: pd.DataFrame) -> int: Get the number of execution cases in the process eventlog.
    • Parameters:
      • data: Pandas DataFrame with the dataset.
    • Return: The number of unique cases.
  • get_num_activities(data: pd.DataFrame) -> int: Get the number of unique activities in the process eventlog.
    • Parameters:
      • data: Pandas DataFrame with the dataset.
    • Return: The number of unique activities.
  • get_num_resources(data: pd.DataFrame) -> int: Get the number of unique resources in the process eventlog.
    • Parameters:
      • data: Pandas DataFrame with the dataset.
    • Return: The number of unique resources.
  • get_case_lens(data: pd.DataFrame) -> (int, int, int): Get the average, maximum and minimum case length in the process eventlog.
    • Parameters:
      • data: Pandas DataFrame with the dataset.
    • Return: The average case length, the max case length. and the min case length.
  • get_num_variants(data: pd.DataFrame) -> int: Get the number of different traces (variants) in the process eventlog.
    • Parameters:
      • data: Pandas DataFrame with the dataset.
    • Return: Number of variants (unique sequences of activities).
  • get_top_variants(data: pd.DataFrame, top: int = 5) -> dict: Get the top most repeated variants
    • Parameters:
      • data: Pandas DataFrame with the dataset.
      • count: Number of variants to show in the top.
    • Return: Dictionary with the top repeated variants and their count.


  • get_datasets_list(path: str, batch_mode: bool) -> list: Get list of paths to datasets to be processed.
    • Parameters:
      • path: Path to the dataset or folder.
      • batch_mode: If batch mode is used or only one dataset.
    • Return: A list of the path to datasets.
  • convert_xes_to_csv(xes_path: str, use_act: bool = True, use_time: bool = True, use_res: bool = True, csv_path: str = None) -> str: Convert the XES file with the dataset to a CSV format file.
    • Parameters:
      • xes_path: Full path to the XES file.
      • use_act: Boolean indicating if use activity column.
      • use_time: Boolean indicating if use timestamp column.
      • use_res: Boolean indicating if use resource column.
      • csv_path: Path where the csv file will be stored
    • Return: Full path to the CSV file.


It contains the functions to train the different embedding models and to retrieve the generated embeddings.


  • get_skipgram_embeddings(cases: list[list], win_size: int, emb_size: int, learning_rate: float = 0.002, min_lr: float = 0.002, ns_rate: int = 0, epochs: int = 200, batch_size: int = 32, seed: int = 21) -> (dict, float): Train Word2Vec embeddings using Skipgram methods and return a dictionary with pairs [activity identifier - embedding]
    • Parameters:
      • cases: List of lists, each of which contains the activities of each case.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • min_lr: Learning rate will linearly drop to this value as training progresses.
      • ns_rate: Integer indicating the ratio of negative samples for each positive sample. If 0, no negative sampling is used.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.
  • get_cbow_embeddings(cases: list[list], win_size: int, emb_size: int, learning_rate: float = 0.002, min_lr: float = 0.002, ns_rate: int = 0, epochs: int = 200, batch_size: int = 32, seed: int = 21) -> (dict, float): Train Word2Vec embeddings using CBOW methods and return a dictionary with pairs [activity identifier - embedding]
    • Parameters:
      • cases: List of lists, each of which contains the activities of each case.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • min_lr: Learning rate will linearly drop to this value as training progresses.
      • ns_rate: Integer indicating the ratio of negative samples for each positive sample. If 0, no negative sampling is used.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_glove_embeddings(cases: list[list], win_size: int, emb_size: int, vocab_size: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (dict, float): Train GloVe embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • cases: List of lists, each of which contains the activities of each case.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • vocab_size: Number of categories (embeddings generated).
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
      • use_gpu: Boolean indicating if GPU for the training of the embeddings.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_acov_embeddings(train_cases: list[list], val_cases: list[list], win_size: int, emb_size: int, num_categories: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (dict, float): Train ACOV embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • train_cases: List of lists, each of which contains the activities of each case in training partition.
      • val_cases: List of lists, each of which contains the activities of each case in validation partition.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • num_categories: Number of unique elements (embeddings generated).
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
      • use_gpu: Boolean indicating if GPU for the training of the embeddings.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_dwc_embeddings(train_cases: list[list], val_cases: list[list], win_size: int, emb_size: int, num_categories: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (dict, float): Train DWC embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • train_cases: List of lists, each of which contains the activities of each case in training partition.
      • val_cases: List of lists, each of which contains the activities of each case in validation partition.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • num_categories: Number of unique elements (embeddings generated).
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
      • use_gpu: Boolean indicating if GPU for the training of the embeddings.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_deepwalk_embeddings(graph: nx.Graph, win_size: int, emb_size: int, learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10, walk_length: int = 10, seed: int = 21) -> (dict, float): Train DeepWalk graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • walk_number: Number of random walks from each node.
      • walk_length: Length of each random walk.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_node2vec_embeddings(graph: nx.Graph, win_size: int, emb_size: int, learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10, walk_length: int = 10, p: float = 1.0, q: float = 1.0, seed: int = 21) -> (dict, float): Train DeepWalk graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • walk_number: Number of random walks from each node.
      • walk_length: Length of each random walk.
      • p: Return parameter (1/p transition probability) to move towards from previous node.
      • q: In-out parameter (1/q transition probability) to move away from previous node.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_walklets_embeddings(graph: nx.Graph, win_size: int, emb_size: int, learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10, walk_length: int = 10, seed: int = 21) -> (dict, float): Train Walklets graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • walk_number: Number of random walks from each node.
      • walk_length: Length of each random walk.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • *** get_role2vec_embeddings(graph: nx.Graph, win_size: int, emb_size: int, learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10, walk_length: int = 10, seed: int = 21) -> (dict, float):*** Train Role2Vec graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • walk_number: Number of random walks from each node.
      • walk_length: Length of each random walk.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_lapaclianeigenmaps_embeddings(graph: nx.Graph, emb_size: int, epochs: int = 200, seed: int = 21) -> (dict, float): Train Laplacian Eigenmpas graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • emb_size: Size of the embeddings generated.
      • epochs: Number of epochs of training.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_diff2vec_embeddings(graph: nx.Graph, win_size: int, emb_size: int, learning_rate: float = 0.002, epochs: int = 200, diffusion_number: int = 10, diffusion_cover: int = 10, seed: int = 21) -> (dict, float): Train Diff2Vec graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • win_size: Size of the window context.
      • emb_size: Size of the embeddings generated.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • diffusion_number: Number of diffusions.
      • diffusion_cover: Number of nodes in diffusion.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_glee_embeddings(graph: nx.Graph, emb_size: int, seed: int = 21) -> (dict, float): Train GLEE graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • emb_size: Size of the embeddings generated.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.


  • get_nmfadmm_embeddings(graph: nx.Graph, emb_size: int, epochs: int = 200, seed: int = 21) -> (dict, float): Train NMF-ADMM graph embeddings and return a dictionary with pairs [activity identifier - embedding].
    • Parameters:
      • graph: Networkx Graph with the structure of the process.
      • emb_size: Size of the embeddings generated.
      • epochs: Number of epochs of training.
      • seed: Seed to set the random state and get reproducibility.
    • Return: Dictionary with the embeddings and the time expended during the training.



  • train_test_LSTMonehot(train_cases: list[list], val_cases: list[list], test_cases: list[list], num_categories: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (float, float, float): Train and test LSTM_onehot next activity prediction model

    • Parameters:
      • train_cases: List of lists, each of which contains the activities of each case in training partition.
      • val_cases: List of lists, each of which contains the activities of each case in validation partition.
      • test_cases: List of lists, each of which contains the activities of each case in testing partition.
      • num_categories: Number of unique activities.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
      • use_gpu: Boolean indicating if GPU for the training the model.
    • Return: The accuracy in test partition, the training time and the testing time.
  • train_test_LSTMemblayer(train_cases: list[list], val_cases: list[list], test_cases: list[list], num_categories: int, emb_size: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (float, float, float): Train and test LSTM_emblayer next activity prediction model

    • Parameters:
      • train_cases: List of lists, each of which contains the activities of each case in training partition.
      • val_cases: List of lists, each of which contains the activities of each case in validation partition.
      • test_cases: List of lists, each of which contains the activities of each case in testing partition.
      • num_categories: Number of unique activities.
      • emb_size: Size of the embeddings.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
      • use_gpu: Boolean indicating if GPU for the training the model.
    • Return: The accuracy in test partition, the training time and the testing time.
  • train_test_LSTMembeddings(train_cases: list[list], val_cases: list[list], test_cases: list[list], num_categories: int, embeddings_dict: dict, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (float, float, float): Train and test LSTM_embeddings next activity prediction model

    • Parameters:
      • train_cases: List of lists, each of which contains the activities of each case in training partition.
      • val_cases: List of lists, each of which contains the activities of each case in validation partition.
      • test_cases: List of lists, each of which contains the activities of each case in testing partition.
      • num_categories: Number of unique activities.
      • embeddings_dict: Dictionary with the activities and their embeddings.
      • learning_rate: The initial learning rate.
      • epochs: Number of epochs of training.
      • batch_size: Size of the mini-batches.
      • seed: Seed to set the random state and get reproducibility.
      • use_gpu: Boolean indicating if GPU for the training the model.
    • Return: The accuracy in test partition, the training time and the testing time.