This project aims at computing a weighted graph that (exactly or approximately) realizes a given distance matrix. For a distance matrix
Here is a brief example showing what this project aims to do:
This project is written in Python and can be run in a Python environment.
First, clone this repository to your local machine and access the main directory using the command below:
git clone https://github.com/hayamizum/CactusAlgorithm.git
cd CactusAlgorithm
To run this project, use:
python cactus.py
and follow the instructions in tutorial.
To run this project, you may require the following packages:
- sys
- numpy
- networkx
- matplotlib
- itertools
- copy
- csv
- pandas
The code files are in Code
directory and data mentioned are in Data
directory.
When running the code, you can provide a distance matrix in either csv or stdin format.
For csv mode, you need to input a distance matrix as a csv format file that satisfies the following requirements. The absolute path or relative path is needed for searching the file.
- The csv file must have
$n+1$ lines, where$n$ is the number of objects (i.e. the size of the distance matrix). - The first line of the csv file only states the number
$n$ of objects in the first entry. - In the remainder
$n$ lines of the csv file, each line states the corresponding row of the distance matrix.
For stdin mode, it is required to input the distances manually. Shown below is an example using another distance matrix. The first parameter is about the number of objects
Input n: 5
Input Distance Matrix:
0 2 1 1 1
2 0 1 1 1
1 1 0 2 2
1 1 2 0 2
1 1 2 2 0
Given the above distance matrix, the code cactus.py
yields an optimal realization that is a True
or False
).
True
We here demonstrate the code using a sample distance matrix. The sample distance matrix has been created as follows. First, we randomly generated 20 grid points in the plane as below using generate_random_integer_points.py
, and the points' coordinates are all integers for the reason that the float number will cause precision loss when it comes to add +
operations. Then, we computed their pairwise distances using L1 metric. The resulting distance matrix is found in manhattan_distances_20x20.csv
.
Given the above distance matrix as input, the code cactus.py
computes the adjacency matrix of its optimal realization and draws it using the Kamada–Kawai layout algorithm.
csv or stdin:csv
File Name:manhattan_distances_20x20.csv
The Kamada–Kawai algorithm generally works well, but as seen below, a slight modification (updated_cactus_code_for_20x20_points_and_colorcircle.py
) yields a better drawing:
HIV sequence data in fasta format are found in hiv-db_gap_strip.fasta
. Using from_fasta_to_distance_matrix.py
, you can calculate the p-distances (or Hamming distances) between the sequences, which has been saved in hiv_seq_hamming_distance.csv
. You can visualize the result by using updated_cactus_code_for_hiv_data.py
. To make the result more friendly to beginners, the figure has been annotated manually and attached with colors.