/PyML

Machine Learning: PCA, Random Forest, Convex Hulls, etc.

Primary LanguagePython

PyML

A machine learning set of tools for using PCA to classify galaxies using agglomerative clustering and convex hulls.

Usage:

If catalog is a dictionary with the following parameters: ['C','M20','GINI','ASYM','MPRIME','I','D'], then use these steps to project morphological data onto predefined PC eigenvectors and to classify galaxies based on the groups defined in Peth et al. 2016:

from PyML import machinelearning as pyml
from PyML import convexhullclassifier as cvx

parameters = ['C_J','M20_J','GINI_J','ASYM_J','MPRIME_J','I_J','D_J']
#Statistics MUST be in this order
npmorph = pyml.dataMatrix(catalog,parameters)
pc = pyml.pcV(npmorph) #Principal Components
groups  = cvx.convexHullClass(pc.X)     #Groups using convex hull classifier

For Random Forest Classifications, usage:

from PyML import machinelearning as pyml
import pandas as pd

cols = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7',\
'g','m20','mprime','i','d','a','c','gr_col','logMass','ssfr','f_gm20','d_gm20']

#Use pandas to read in data as a dataframe (df)
#df = pd.read_csv('DataFile.txt')
#df = pd.read_pickle('ps1_morph_spec_gz_pc.pkl')

result, labels, label_probability = ml.randomForestMC(df,iterations=1000)
#result = summary statistics, feature importances (N iterations x N statistics/importances)
#labels = labels following random forest (N galaxies x N iterations)
#label_probability = probability of label following random forest (N galaxies x N iterations)