Pronouced View SETI, this web server/client Python application can be used for visualisation of clustered data generated using machine learning algorithms from a separate script. This tool can help with data exploration and as a pre-processing tool for feature selection and dimensionality reduction in machine learning prior to training a classifier.
Kyle Harrison: kyleharrison1994@gmail.com
Dr Amit Mishra: akmishra@ieee.org
The rendered Bootstrap HTML page dynmically generates content based on input data and uses Vis.JS for a 3D cluster visualisation scatter plot and Canvas.JS for 1D raw data visualisation. Communication between the server and client is done in JSON via AJAX jQuery and all event handlers are performed by jQuery.
The Data_Analyis_Script.py uses Scikit-Learn for machine learning dimensionality reduction, feature selection and data clustering. The script uses component analysis techniques on unlabelled data stored in HDF5 files and clustering techniques to apply labels to the computed components. These algorithms are fundamental to unsupervised machine learning, using correlation and variance to derive class membership in order to pre-process datasets and make the process of training classifiers easier for other projects.
pip install -r requirements.txt
- Python 2.7
- NumPy
- SciPy
- Flask
- Scikit-Learn
This project formed the basis of my Computer and Electrical undergraduate thesis and was completed in a matter of weeks, as such there is much that can be added for improvement. If you would like to use this tool please contact me or post an issue. I have tried by best to document the tool in this readme, more can be found in my report
- HTML - the base design needs to be improved, it was built for functionality - not looks
- Server - Flask is a micro-framework and could be improved with a Django implementation
- Database - HDF5 files are great for storage and easy to use but the ideal system would have SQL database interaction
- Machine Learning - more analysis techniques can be added, better parameter estimation techniques are available
The primary use case for this tool is analysis of Radio Frequency Interference data from time-domain transient and frequency-domain spectral data. By convention in astronomical data, the storage is done by HDF5 file.
The machine learning algorithms applied to raw data are:
- KPCA - Kernel Principal Component Analysis
- PCA - Principal Component Analyis
- DBSCAN - Density-based spatial clustering of applications with noise
KPCA and PCA utilise the variance within unlabelled data generate clusters based on similarities and differences between samples. The components generated are new features for the data representing the greatest degrees of variances from the original features and linear combinations of those features.
DBSCAN is a density based clustering tool that applies labels to data in densily clustered space and labels outliers in areas of low density.
KPCA's Gamma is estimated through an iterative brute force search. DBSCAN's eps is estimated through the standard deviation of euclidean pairwise distances.
Metrics for clustering used in parameter estimation are:
- SI - Thortons Separability Index
- Silhouette Score
Analysis data is stored in H5 files with groups for anaylsis type and datasets within the groups for resulting components and labels from DBSCAN. The raw data used for analysis is recorded in a separate dataset. This format is shown below:
The script for anaylsis is run separately to the server in order to prevent latency issues when run on slower systems. The server loads resulting H5 files requested by the client into memory and returns components, labels and raw data to the web page for visualisation.
A more detailed block diagram is shown below:
The methods in Server.py using Flask to render HTML pages and H5Py to interact with HDF5 files is shown below:
The JavaScript interactions in Webpage.html are shown below: