A package to train and generate activity-level embeddings for process mining
Install the PME package with pip
$> pip install pme
- Pandas
- Scikit-Learn
- PM4PY
- Torch
- Gensim
- KarateClub
The package has three main modules:
It allows to read datasets in .XES format, convert them to .CSV format, save them, display their main characteristics (number of events, unique activities and resources...), perform holdout or cross-validation partitioning and define the objects to store the eventlog data and use them with the rest of the package functions.
- EventlogDataset: Class storing the training, validation and test partitions of a
given dataset and its main features useful for training embeddings.
- Attributes:
- filename: Name of the dataset.
- directory: Path to the folder where the dataset is stored.
- df_train: Pandas DataFrame of the train partition of the dataset.
- df_val: Pandas DataFrame of the validation partition of the dataset.
- df_test: Pandas DataFrame of the test partition of the dataset.
- num_activities: Number of unique activities in the eventlog.
- num_resources: Number of unique resources in the eventlog.
- Constructor parameters:
- csv_path: Path to the .csv file with the full eventlog.
- cv_fold: Number of fold if a cross-validation fold is read. Default: None.
- read_test: Boolean indicating if read test split is necessary. Default: True.
- Attributes:
- make_holdout(csv_path: str, train_size: float = 80,
val_size_from_train: float = 20, splits_path: str = None):
Create the train-val-test splits and store them.
- Parameters:
- csv_path: Full path to the CSV file with the dataset.
- train_size: Percentage of the data for the training partition (the test partition is the remaining percentage). Number between 1 and 100.
- val_size_from_train: Percentage of the training partition reserved for validation. Number between 1 and 100.
- splits_path: Full path where CSV splits will be written.
- Parameters:
- make_crossvalidation(csv_path: str, num_folds: int = 5, val_size_from_train: float = 20,
splits_path: str = None, seed: int = 21):
Create the k-fold cross-validation and store the folds.
- Parameters:
- csv_path: Full path to the CSV file with the dataset.
- num_folds: Number of folds in the cross-validation.
- val_size_from_train: Percentage of the training partition reserved for validation. Number between 1 and 100.
- splits_path: Full path where CSV splits will be written.
- seed: Seed to set the random state and get reproducibility.
- Parameters:
- get_num_cases(data: pd.DataFrame) -> int:
Get the number of execution cases in the process eventlog.
- Parameters:
- data: Pandas DataFrame with the dataset.
- Return: The number of unique cases.
- Parameters:
- get_num_activities(data: pd.DataFrame) -> int:
Get the number of unique activities in the process eventlog.
- Parameters:
- data: Pandas DataFrame with the dataset.
- Return: The number of unique activities.
- Parameters:
- get_num_resources(data: pd.DataFrame) -> int:
Get the number of unique resources in the process eventlog.
- Parameters:
- data: Pandas DataFrame with the dataset.
- Return: The number of unique resources.
- Parameters:
- get_case_lens(data: pd.DataFrame) -> (int, int, int):
Get the average, maximum and minimum case length in the process eventlog.
- Parameters:
- data: Pandas DataFrame with the dataset.
- Return: The average case length, the max case length. and the min case length.
- Parameters:
- get_num_variants(data: pd.DataFrame) -> int:
Get the number of different traces (variants) in the process eventlog.
- Parameters:
- data: Pandas DataFrame with the dataset.
- Return: Number of variants (unique sequences of activities).
- Parameters:
- get_top_variants(data: pd.DataFrame, top: int = 5) -> dict:
Get the top most repeated variants
- Parameters:
- data: Pandas DataFrame with the dataset.
- count: Number of variants to show in the top.
- Return: Dictionary with the top repeated variants and their count.
- Parameters:
- get_datasets_list(path: str, batch_mode: bool) -> list:
Get list of paths to datasets to be processed.
- Parameters:
- path: Path to the dataset or folder.
- batch_mode: If batch mode is used or only one dataset.
- Return: A list of the path to datasets.
- Parameters:
- convert_xes_to_csv(xes_path: str, use_act: bool = True, use_time: bool = True,
use_res: bool = True, csv_path: str = None) -> str:
Convert the XES file with the dataset to a CSV format file.
- Parameters:
- xes_path: Full path to the XES file.
- use_act: Boolean indicating if use activity column.
- use_time: Boolean indicating if use timestamp column.
- use_res: Boolean indicating if use resource column.
- csv_path: Path where the csv file will be stored
- Return: Full path to the CSV file.
- Parameters:
It contains the functions to train the different embedding models and to retrieve the generated embeddings.
- get_skipgram_embeddings(cases: list[list], win_size: int, emb_size: int,
learning_rate: float = 0.002, min_lr: float = 0.002,
ns_rate: int = 0, epochs: int = 200, batch_size: int = 32,
seed: int = 21) -> (dict, float):
Train Word2Vec embeddings using Skipgram methods and return
a dictionary with pairs [activity identifier - embedding]
- Parameters:
- cases: List of lists, each of which contains the activities of each case.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- min_lr: Learning rate will linearly drop to this value as training progresses.
- ns_rate: Integer indicating the ratio of negative samples for each positive sample. If 0, no negative sampling is used.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_cbow_embeddings(cases: list[list], win_size: int, emb_size: int,
learning_rate: float = 0.002, min_lr: float = 0.002,
ns_rate: int = 0, epochs: int = 200, batch_size: int = 32,
seed: int = 21) -> (dict, float):
Train Word2Vec embeddings using CBOW methods and return
a dictionary with pairs [activity identifier - embedding]
- Parameters:
- cases: List of lists, each of which contains the activities of each case.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- min_lr: Learning rate will linearly drop to this value as training progresses.
- ns_rate: Integer indicating the ratio of negative samples for each positive sample. If 0, no negative sampling is used.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_glove_embeddings(cases: list[list], win_size: int, emb_size: int, vocab_size: int,
learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32,
seed: int = 21, use_gpu: bool = True) -> (dict, float):
Train GloVe embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- cases: List of lists, each of which contains the activities of each case.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- vocab_size: Number of categories (embeddings generated).
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- use_gpu: Boolean indicating if GPU for the training of the embeddings.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_acov_embeddings(train_cases: list[list], val_cases: list[list], win_size: int, emb_size: int,
num_categories: int, learning_rate: float = 0.05, epochs: int = 200,
batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (dict, float):
Train ACOV embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- train_cases: List of lists, each of which contains the activities of each case in training partition.
- val_cases: List of lists, each of which contains the activities of each case in validation partition.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- num_categories: Number of unique elements (embeddings generated).
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- use_gpu: Boolean indicating if GPU for the training of the embeddings.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_dwc_embeddings(train_cases: list[list], val_cases: list[list], win_size: int, emb_size: int,
num_categories: int, learning_rate: float = 0.05, epochs: int = 200,
batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (dict, float):
Train DWC embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- train_cases: List of lists, each of which contains the activities of each case in training partition.
- val_cases: List of lists, each of which contains the activities of each case in validation partition.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- num_categories: Number of unique elements (embeddings generated).
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- use_gpu: Boolean indicating if GPU for the training of the embeddings.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_deepwalk_embeddings(graph: nx.Graph, win_size: int, emb_size: int,
learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10,
walk_length: int = 10, seed: int = 21) -> (dict, float):
Train DeepWalk graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- walk_number: Number of random walks from each node.
- walk_length: Length of each random walk.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_node2vec_embeddings(graph: nx.Graph, win_size: int, emb_size: int,
learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10,
walk_length: int = 10, p: float = 1.0, q: float = 1.0, seed: int = 21) -> (dict, float):
Train DeepWalk graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- walk_number: Number of random walks from each node.
- walk_length: Length of each random walk.
- p: Return parameter (1/p transition probability) to move towards from previous node.
- q: In-out parameter (1/q transition probability) to move away from previous node.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_walklets_embeddings(graph: nx.Graph, win_size: int, emb_size: int,
learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10,
walk_length: int = 10, seed: int = 21) -> (dict, float):
Train Walklets graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- walk_number: Number of random walks from each node.
- walk_length: Length of each random walk.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- *** get_role2vec_embeddings(graph: nx.Graph, win_size: int, emb_size: int,
learning_rate: float = 0.002, epochs: int = 200, walk_number: int = 10,
walk_length: int = 10, seed: int = 21) -> (dict, float):***
Train Role2Vec graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- walk_number: Number of random walks from each node.
- walk_length: Length of each random walk.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_lapaclianeigenmaps_embeddings(graph: nx.Graph, emb_size: int,
epochs: int = 200, seed: int = 21) -> (dict, float):
Train Laplacian Eigenmpas graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- emb_size: Size of the embeddings generated.
- epochs: Number of epochs of training.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_diff2vec_embeddings(graph: nx.Graph, win_size: int, emb_size: int,
learning_rate: float = 0.002, epochs: int = 200, diffusion_number: int = 10,
diffusion_cover: int = 10, seed: int = 21) -> (dict, float):
Train Diff2Vec graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- win_size: Size of the window context.
- emb_size: Size of the embeddings generated.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- diffusion_number: Number of diffusions.
- diffusion_cover: Number of nodes in diffusion.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_glee_embeddings(graph: nx.Graph, emb_size: int,
seed: int = 21) -> (dict, float):
Train GLEE graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- emb_size: Size of the embeddings generated.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
- get_nmfadmm_embeddings(graph: nx.Graph, emb_size: int,
epochs: int = 200, seed: int = 21) -> (dict, float):
Train NMF-ADMM graph embeddings and return a dictionary with pairs [activity identifier - embedding].
- Parameters:
- graph: Networkx Graph with the structure of the process.
- emb_size: Size of the embeddings generated.
- epochs: Number of epochs of training.
- seed: Seed to set the random state and get reproducibility.
- Return: Dictionary with the embeddings and the time expended during the training.
- Parameters:
-
train_test_LSTMonehot(train_cases: list[list], val_cases: list[list], test_cases: list[list], num_categories: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (float, float, float): Train and test LSTM_onehot next activity prediction model
- Parameters:
- train_cases: List of lists, each of which contains the activities of each case in training partition.
- val_cases: List of lists, each of which contains the activities of each case in validation partition.
- test_cases: List of lists, each of which contains the activities of each case in testing partition.
- num_categories: Number of unique activities.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- use_gpu: Boolean indicating if GPU for the training the model.
- Return: The accuracy in test partition, the training time and the testing time.
- Parameters:
-
train_test_LSTMemblayer(train_cases: list[list], val_cases: list[list], test_cases: list[list], num_categories: int, emb_size: int, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (float, float, float): Train and test LSTM_emblayer next activity prediction model
- Parameters:
- train_cases: List of lists, each of which contains the activities of each case in training partition.
- val_cases: List of lists, each of which contains the activities of each case in validation partition.
- test_cases: List of lists, each of which contains the activities of each case in testing partition.
- num_categories: Number of unique activities.
- emb_size: Size of the embeddings.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- use_gpu: Boolean indicating if GPU for the training the model.
- Return: The accuracy in test partition, the training time and the testing time.
- Parameters:
-
train_test_LSTMembeddings(train_cases: list[list], val_cases: list[list], test_cases: list[list], num_categories: int, embeddings_dict: dict, learning_rate: float = 0.05, epochs: int = 200, batch_size: int = 32, seed: int = 21, use_gpu: bool = True) -> (float, float, float): Train and test LSTM_embeddings next activity prediction model
- Parameters:
- train_cases: List of lists, each of which contains the activities of each case in training partition.
- val_cases: List of lists, each of which contains the activities of each case in validation partition.
- test_cases: List of lists, each of which contains the activities of each case in testing partition.
- num_categories: Number of unique activities.
- embeddings_dict: Dictionary with the activities and their embeddings.
- learning_rate: The initial learning rate.
- epochs: Number of epochs of training.
- batch_size: Size of the mini-batches.
- seed: Seed to set the random state and get reproducibility.
- use_gpu: Boolean indicating if GPU for the training the model.
- Return: The accuracy in test partition, the training time and the testing time.
- Parameters: