This repository is a collection of command-line tools for data analysis and visualization. These tools essentially form a thin interface around several commonly-used Python packages for data science and machine learning.
This tool depends on several Python packages, all of which can be easily installed in an Anaconda environment:
conda install matplotlib numpy pandas scikit-learn seaborn tensorflow-gpu==1.12.0
There are four primary scripts:
classify.py
: classification algorithmscluster.py
: clustering algorithmsregress.py
: regression algorithmsvisualize.py
: data visualization
These scripts are found in the bin
folder. Each script takes two inputs: (1) a tab-delimited data matrix and (2) a JSON configuration file. The data matrix is read as a pandas DataFrame; it should contain row-wise samples and should include both features and outputs. The JSON config file should specify numerical features, categorical features, and outputs. The following example could be for a dataset of housing prices:
{
"numerical": [
"age",
"area",
],
"categorical": [
"state",
"zip",
"color",
"foreclosed"
],
"output": [
"price"
]
}
The scripts
folder contains a collection of helper scripts for performing odd tasks. The create-config.py
script can generate a basic config file from any tab-delimited data file, but you will likely need to modify it to suit your particular needs.