/RAMEN

Primary LanguagePythonMIT LicenseMIT

Documentation

https://ramen20.readthedocs.io/

RAMEN Method Overview

The Ramen method is composed of two major components: random walks to select the most relevant variables to the COVID-19 outcomes (severity or long-COVID) and build a draft candidate network; Genetic Algorithm to find the optimized network that represents the relationships between different clinical variables based on the candidate network draft from the random walk. Random walks: Frist, we calculate mutual information between very possible pairs of clinical variables (all possible edges). All the calculated mutual information will be normalized and represented as the transition probability between clinical variables (edges). We then perform random walks starting from each non-terminal variables (all variables other than the COVID outcomes: severity or long COVID) for N random steps. A random walk will stop once it reaches the destination (absorbing terminal nodes: severity or long-COVID) or run out of steps. After the random walks, we will count the number of visits for each of the edges in all successful random walks (the random walks that reache the terminal destination within N steps). Next, we perform random permutations of the transition probabilities between all nodes (edges). With the random transition probabilities, we will reperform random walks to get the number of visits for all edges by random. Third, we will then employ random permutations to filer edges that are not significiantly visited. Genetic Algorithm: The network with all remaining significant edges (so do the nodes that are connected by those edges) will be used as the starting point to search for the Bayesian network with a Genetic Algorithm. First, we will generate candidate parent networks from the candidate network obtained with random walks. Next, we will crossover all those parent networks to produce offspring networks. Third, each of the offspring networks will mutate to produce more candidate networks. Fourth, all these candidate networks (parents, offspring, and their mutations) will be scored to select the best networks as the parents for the next generation. We will keep performing the above ‘evolution’ process until convergence to obtain the final relationship network.

PipelineGraph

Technical Summary

Installation

To install Ramen, go to the project environment, and run pip install git+https://github.com/mcgilldinglab/RAMEN@main

Note that some other packages might need to be installed to run this package, if Anaconda is used, then most of the dependencies will be present.

Usage

To use Ramen, import the "Ramen" class from ramen.Ramen and initialize a Ramen object. The data should be processed before using Ramen. Ramen will only remove the variables that have a certain threshold of non missing values and discretize the data. It is possible to adjust the threshold through the constructor or field of the Ramen object. An end variable must also be set, so that RandomWalk terminates upon reaching the variable. After initializing the Ramen object, random_walk can be run. random_walk must be run before genetic_algorithm, as the output from Random Walk is used as input for Genetic Algorithm to create the starting candidates. genetic_algorithm will generate the final network.

Ramen Object Fields

  • df (pandas.DataFrame): discretized dataframe, must be input when creating the object.
  • var_ref (dictionary): dictionary mapping the real values to the discretized value e.g { variable: { "Yes" : 0, "No" : 1 } }.
  • end_string (string): variable indicating the termination node for random_walk. The string must represent a column in the dataframe.
  • mutual_info_array (np.array): 2D array continaining the mutual information for all pairs of variables, initialized at the constructor.
  • signif_edges (list): list containing all of the significant edges after random walk permutation test stored in string format. This field is None and is initialized after termination of random_walk.
  • network (networkx.DiGraph): graph object of the final network after terminating RAMEN method. This is set to None and initialized after termination of genetic_algorithm.

Ramen Constructor

init( self, csv_data = None, ref_save_name = "var_val_ref.pickle", end_string = "", bad_var_threshold = 500 )

  • csv_data (string: path for a csv file): This parameter is mandatory, it is the data in csv format. Preprocessing should be done before using it in Ramen. Missing values in the dataset should either be NaN or -999. Ramen will discretize the data to be used for the subsequent steps.
  • end_string (string): This parameter must be the name in string of the variable that is studied in the dataset. If it is not a variable in the dataset, it will raise an Exception.
  • min_values (int): All variables with less than this amount of non-missing values will be removed from the dataframe.

Random Walk Method

random_walk( self, num_exp = 10, num_walks = 50000, num_steps = 7, p_value = 0.05, mode = "default" )

  • num_exp (int): Number of experiments in the random walk.
  • num_walks (int): Number of walks in one experiment of random walk.
  • num_steps (int): Number of steps per walk.
  • p_value (float): The p-value cutoff for the permutation test. Another standard cutoff is 0.01.
  • correction (string): The correction to the p-value, currently "fdr" is implemented, otherwise, it defaults to "no_correction".

Genetic Algorithm Method

genetic_algorithm( self, num_candidates = 10, end_thresh = 0.01, mutate_num = 100, best_cand_num = 10, bad_reprod_accept = 10, reg_factor = 0.01, hard_stop = 100 )

  • num_candidates (int): The number of starting candidates.
  • end_thresh (float): If the increase in score from one generation to the next is less than the end_thresh, then it is considered a bad generation.
  • mutate_num (int): The number of mutation children for each candidate.
  • best_cand_num (int): The number of best candidates that is kept at each generation.
  • bad_reprod_accept (int): The number of bad generations accepted before terminating. This counter is reset whenever there is a good generation.
  • reg_factor (float): The score that is deducted for each edge in the network.
  • hard_stop (int): Maximum iteration before terminating.

Other methods

pickle_signif_edges(self, filename = "significant_edges.pickle") -> save signif_edges to a pickle file.

  • filename (str): save path of pickle.

load_signif_edges_pickle(self, filename) -> load signif_edges from a pickle file

  • filename (str): the path of the pickle saving the significant edges.

pickle_final_network(self, filename) -> save final network to a pickle object

  • filename (str): save path of pickle.

set_end_string(self, end_string) -> set the end_string

  • end_string (str): new end string

get_signif_edges(self) -> get the signif_edges

set_signif_edges(self, signif_edges) -> set the signif_edges field

  • signif_edges (list): set the significant edges to be a new list of edges.

get_var_ref(self) -> get the variable values mapping created from discretization

get_mutual_info_array(self) -> get the mutual information matrix

Example usage

After installing Ramen package using the command above:

Initializing Ramen object

Screen Shot 2023-01-26 at 10 28 55 AM

Initiating Random Walk

Screen Shot 2023-01-26 at 10 29 05 AM

Initiating Genetic Algorithm

Screen Shot 2023-01-26 at 10 29 14 AM

Credits

This repository is developed by Yiwei Xiong and Jingtao Wang. We also have a web app http://dinglab.rimuhc.ca/pgm/ to interact with networks developed by Xiaoxiao Shang. This project is done under the supervision of Professor Jun Ding.

Tingting Chen processed data allowing us to test our methods. Professor Douglas D. Fraser provided an alternative dataset allowing us to check our method on an alternative dataset. Professor Gregory Fonseca and Simon Rousseau provided the dataset on which the method is built and provided insights into biology knowledge.