PACKAGE DOCUMENTATION package ml import "." Package ml provides some implementations of usefull machine learning algorithms for data mining and data analysis. The implemented algorithms are: - Linear Regression - Logistic Regression - Neural Networks Is implemented too the fmincg function in order to calculate the optimal theta configuration to reduce the cost value for all the implemented solutions. Author: Alonso Vidales <alonso.vidales@tras2.es> Use of this source code is governed by a BSD-style. These programs and documents are distributed without any warranty, express or implied. All use of these programs is entirely at the user's own risk. FUNCTIONS func Fmincg(nn DataSet, lambda float64, length int, verbose bool) (fx []float64, i int, err error) Minimize a continuous differentialble multivariate function. Starting point is given by the "Lambda" property (D by 1), and the method named "CostFunction", must return a function value and a vector of partial derivatives. The Polack- Ribiere flavour of conjugate gradients is used to compute search directions, and a line search using quadratic and cubic polynomial approximations and the Wolfe-Powell stopping criteria is used together with the slope ratio method for guessing initial step sizes. Additionally a bunch of checks are made to make sure that exploration is taking place and that extrapolation will not be unboundedly large. The "length" gives the length of the run: if it is positive, it gives the maximum number of line searches, if negative its absolute gives the maximum allowed number of function evaluations. The function returns when either its length is up, or if no further progress can be made (ie, we are at a minimum, or so close that due to numerical problems, we cannot get any closer). If the function terminates within a few iterations, it could be an indication that the function value and derivatives are not consistent (ie, there may be a bug in the implementation of your "f" function). The function returns "fx" indicating the progress made and "i" the number of iterations (line searches or function evaluations, depending on the sign of "length") used. Copyright (C) 2001 and 2002 by Carl Edward Rasmussen. Date 2002-02-13 Ported from Octave to Go by Alonso Vidales <alonso.vidales@tras2.es> (C) Copyright 1999, 2000 & 2001, Carl Edward Rasmussen Permission is granted for anyone to copy, use, or modify these programs and accompanying documents for purposes of research or education, provided this copyright notice is retained, and note is made of any changes that have been made. These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user's own risk. func MapFeatures(x [][]float64, degree int) (ret [][]float64) This method calculates all the possible combinations of the features and returns them with the specified degree, for example, for a data.X with x1, x2 and degree 2 will convert data.X to 1, x1, x2, x1 * x2, x1 ** 2, x2 ** 2, (x1 * x2) ** 2 Use this method with care in order to calculate the model who fits better with the problem func Normalize(values []float64) (norm []float64, valid bool) Returns all the values of the given matrix normalized, the formula applied to all the elements is: (Xn - Avg) / (max - min) If all the elements in the slice have the same values, or the slice is empty, the slice can't be normalized, then returns false in the valid parameter func PrepareX(x [][]float64, degree int) (newX [][]float64) Retrns the x matrix with all the elements at the power of x, x-1, x-2, ... 1 and adds at the being of each row a 1 in order to be used as bias value For example for a given matrix like: 3 4 5 8 Prepared at the power of 2 (x = 2): 1 3 9 4 16 1 5 25 8 64 TYPES type DataSet interface { // Returns the cost and gradients for the current thetas configuration CostFunction(lambda float64, calcGrad bool) (j float64, grad [][][]float64, err error) // contains filtered or unexported methods } Interface to be implemented by the machine learning algorithms to be used by the Fmincg function in order to reduce the cost type NeuralNet struct { // Training set of values for each feature, the first dimension are the test cases X [][]float64 // The training set with values to be predicted Y [][]float64 // 1st dim -> layer, 2nd dim -> neuron, 3rd dim theta Theta [][][]float64 } Neural network representation, the X and Y properties are to be used with training proposals func NewNeuralNetFromCsv(xSrc string, ySrc string, thetaSrc []string) (result *NeuralNet) Loads the informaton contained in the specified file paths and returns a NeuralNet instance. Each input file should contain a row by sample, and the values separated by a single space. To load the thetas specify on thetaSrc the file paths that contains each of the layer values. The order of this paths will represent the order of the layers. In case of need only to load the theta paramateres, specify a empty string on the xSrc and ySrc parameters. func (nn *NeuralNet) CostFunction(lambda float64, calcGrad bool) (j float64, grad [][][]float64, err error) Calcualtes the cost function for the training set stored in the X and Y properties of the instance, and with the theta configuration. The lambda parameter controls the degree of regularization (0 means no-regularization, infinity means ignoring all input variables because all coefficients of them will be zero) The calcGrad param in case of true calculates the gradient in addition of the cost, and in case of false, only calculates the cost func (nn *NeuralNet) GetPerformance(verbose bool) (cost float64, performance float64) Returns the performance of the neural network with the current set of samples. The performance is calculated as: matches / total_samples func (nn *NeuralNet) Hipotesis(x []float64) (result []float64) Returns the hipotesis calculation for the sample "x" using the thetas of nn.Theta func (nn *NeuralNet) InitializeThetas(layerSizes []int) Random inizialization of the thetas to break the simetry. The slice "layerSizes" will contain on each element, the size of the layer to be initialized, the first layer is the input one, and last layer will correspond to the output layer func (nn *NeuralNet) MinimizeCost(maxIters int, suffleData bool, verbose bool) (finalCost float64, performance float64, trainingData *NeuralNet, testData *NeuralNet) This metod splits the samples contained in the NeuralNet instance in three sets (60%, 20%, 20%): training, cross validation and test. In order to calculate the optimal theta, after try with different lambda values on the training set and compare the performance obtained with the cross validation set to get the lambda with a better performance in the cross validation set. After calculate the best lambda, merges the training and cross validation sets and trains the neural network with the 80% of the samples. The data can be shuffled in order to obtain a better distribution before divide it in groups func (nn *NeuralNet) SaveThetas(targetDir string) (files []string) Store all the current theta values of the instance in the "targetDir" directory. This method will create a file for each layer of theta called theta_X.txt where X is the layer contained on the file. type Regression struct { X [][]float64 // Training set of values for each feature, the first dimension are the test cases Y []float64 // The training set with values to be predicted // 1st dim -> layer, 2nd dim -> neuron, 3rd dim theta Theta []float64 LinearReg bool // true indicates that this is a linear regression problem, false a logistic regression one } Linear and logistic regression structure func LoadFile(filePath string) (data *Regression) Loads information from the local file located at filePath, and after parse it, returns the Regression ready to be used with all the information loaded The file format is: X11 X12 ... X1N Y1 X21 X22 ... X2N Y2 ... ... ... ... .. XN1 XN2 ... XNN YN Note: Use a single space as separator func (lr *Regression) CostFunction(lambda float64, calcGrad bool) (j float64, grad [][][]float64, err error) Calcualtes the cost function for the training set stored in the X and Y properties of the instance, and with the theta configuration. The lambda parameter controls the degree of regularization (0 means no-regularization, infinity means ignoring all input variables because all coefficients of them will be zero) The calcGrad param in case of true calculates the gradient in addition of the cost, and in case of false, only calculates the cost func (lr *Regression) InitializeTheta() Initialize the Theta property to an array of zeros with the lenght of the number of features on the X property func (data *Regression) LinearHipotesis(x []float64) (r float64) func (data *Regression) LogisticHipotesis(x []float64) (r float64) Returns the hipotesis result for the thetas in the instance and the specified parameters func (data *Regression) MinimizeCost(maxIters int, suffleData bool, verbose bool) (finalCost float64, trainingData *Regression, lambda float64, testData *Regression) This metod splits the given data in three sets: training, cross validation, test. In order to calculate the optimal theta, tries with different possibilities and the training data, and check the best match with the cross validations, after obtain the best lambda, check the perfomand against the test set of data SUBDIRECTORIES test_data