This repository is my attempt at the selection test for the project Human history and data visualization for GSoC 2019.
It contains main.py
and utils.py
. utils.py
has all the functions including those which are responsible for applying the PCA and UMAP
algorithm to the dataset. main.py
contains the driver code to call the functions and take parameters using command line arguments. More
information in How to Run section.
Note: The number in the filename of the plots, the title of the plots and the .txt
files having the 2-d co-ordinates, enclosed in [.]
is the unique ID and can be directly mapped with log.
This contains the PCA and UMAP plots.
This contain the .txt
files which has the 2-D Co-ordinates.
This contains the data files provided. Since the gene data is very big and my laptop is unable to handle it, I have applied the dataset on a small sub-part
of the dataset data_sample.csv
. I will be uploading the complete results when I manage to run the complete dataset.
To run the code the libraries required (and the versions which I used are):
- Pandas - 0.23.4
- NumPy - 1.14.2
- sklearn - 0.20.0
- umap - 0.3.5
- Seaborn - 0.9.0
- Matplotlib - 3.0.0
- clone this repo
- run the following command
python main.py --plot_dir ../plots --crds_dir ../reduced_dim_coordinates --genedata_src ../data/data_sample.csv --label_src ../data/affy_samples.20141118.panel --full_pop_name ../data/20131219.populations.tsv --logfile log.txt
main.py
takes in multiple arguments which are as follows
--logfile
: filepath of the log file, if not given, the default path is stdout--plot_dir
: Directory path for the plots (mandatory)--crds_dir
: Directory path for the reduced dimensional coordinates (mandatory)--genedata_src
: path to the gene data .csv file (mandatory)--label_src
: path to the sample-label .txt file (mandatory)--full_pop_name
: path to name-verbose name .txt file (mandatory)