Machine-Learning-Algorithm: A C repository from X-MingSDU

Machine learning algorithm(by using C language)

See pictures in the "README.pdf".

The algorithm prototype is referenced from《machine learning algorithms from scratch》 by Jason Brownlee.

Language: python to C language

Software : Codeblocks

Algorithm：SLR, MLR, KNN, Navie Bayes

Ⅰ. Simple linear regression algorithm

Dataset: Swedish Auto Insurance Dataset (https://goo.gl/qPSNqY)

1、Algorithm description (from book)

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data. The line for a simple linear regression model can be written as: $y = b0 + b1 × x$ Where b0 and b1 are the coefficients we must estimate from the training data. Once the coefficients are known, we can use this equation to estimate output values for y given new input examples of x. It requires that you calculate statistical properties from the data such as mean,variance and covariance.

2、Function implementation

Create a two-dimensional array

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>   //random number

typedef float type;
#define split 0.6

// Create a two-dimensional array
type **createarray(int n,int m)
{
    int i;
    type **array;
    array=(type **)malloc(n*sizeof(type *));
    array[0]=(type *)malloc(n*m*sizeof(type));
    for(i=1;i<n;i++) array[i]=array[i-1]+m;
    return array;
}

Mean function

double mean_x(type **array,int N)
{
    int i;
    double sum=0;
    double meanx;
    for(i = 0; i < N; i++) {
        sum=sum+array[i][0];
    }
    meanx=sum/N;
    return meanx;
}

Variance function

double variance_x(double m,type **array,int N) //import the mean and training parameters
{
  int i;
  double varx=0.0;
  for(i=0;i<N;i++){
    varx+=pow((array[i][0]-m),2);
  }
  return varx;
}

Covariance function

// covariance function,import mean
double covariance(double a,double b,type **array,int N) //a and b are the mean
{
    int i;
    double covar=0.0;
    for(i=0;i<N;i++)
    {
     covar+=(array[i][0]-a)*(array[i][1]-b);
    }
    return covar;
}

Importing datasets

The imported data set is a data set with n rows and D columns. The file format is TXT, and each row of data is separated by spaces.

void loadfile(int *n,int *d,type ***array) //Datasets with n rows d columns
{
    int i,j;
    FILE *fp;
    if((fp=fopen("SLR.txt","r"))==NULL){
        fprintf(stderr,"can not open data.txt!\n");
    }
    if(fscanf(fp,"N=%d,D=%d",n,d)!=2){
        fprintf(stderr,"reading error!\n");
    }

    *array=createarray(*n,*d);   //create a matrix with n rows and d arrays

    for(i=0;i<*n;i++)
        for(j=0;j<*d;j++)
            fscanf(fp,"%f",&(*array)[i][j]);    //read data

    if(fclose(fp)){
        fprintf(stderr,"can not close data.txt");
    }
}

3、Part of the main function

The data set is divided into training set and test set in proportion

 	int i,j;
    
	//parameters i,j,k
    int D,N,size1,size2;
    type **test=NULL;
    type **Train=NULL;
    type **train=NULL; // N and D
    
	//type **Array=NULL;
    loadfile(&N,&D,&Train);
    size2=N*split;
    size1=N-size2;
    test=createarray(size1,D);
    train=createarray(size2,D);

    srand(2); //random seed
    int t=0;
    while (t < size1)
    {
        int i=rand()%11*(N-t-1)/10;
        for( j = 0; j < D; j++){
           test[t][j] = Train[i][j]; //add the ith row of the dataset to the testing set
        }
        for (j = i; j < (N-t-1); j++) //delete the ith row of the dataset and (i+1)th row become the ith row of the dataset
        {
           Train[j] = Train[j + 1];
        }
         t++;
    }

Calculate the regression coefficients b0 and b1

	type *predict;
    double rmse;
    double meanx,meany,covar,varx;
    double b0,b1;
    
    predict=(type *)malloc(size1*sizeof(type));

    meanx=mean_x(train,size2);
    meany=mean_y(train,size2);
    varx=variance_x(meanx,train,size2);
    covar=covariance(meanx,meany,train,size2);
    b1=covar/varx;
  b0=meany-b1*meanx;;

Calculating accuracy and standard deviation

for(i=0;i<size1;i++)
    {
       predict[i]=b0+test[i][0]*b1;
    }
    //calculate the RMSE
    type sumerror=0.0;
    for(j=0;j<size1;j++)
    {
        sumerror+=pow(predict[j]-test[j][1],2);
    }
        rmse=sumerror/size1;
        rmse=sqrt(rmse);
    printf("RMSE: %f \n",rmse);

4、Results and estimate

All the functions of the algorithm are realized, and the prediction of simple linear regression analysis can be realized.

Ⅱ. Multivariate Linear Regression

1、Algorithm description (from book)：

Dataset: Wine Quality Dataset https://goo.gl/8jsvFK

Linear regression is a technique for predicting a real value. Confusingly, these problems where a real value is to be predicted are called regression problems. Linear regression is a technique where a straight line is used to model the relationship between input and output values. In more than two dimensions, this straight line may be thought of as a plane or hyperplane. Predictions are made as a combination of the input values to predict the output value. Each input attribute (x) is weighted using a coefficient (b), and the goal of the learning algorithm is to discover a set of coefficients that results in good predictions (y)

$y = b0 + b1 × x1 + b2 × x2 + ...$

Coefficients can be found using stochastic gradient descent.

2、Related function

The two functions below are the same as functions that I have refered in the SLR algorithm

Import datasets
Create a dynamic two-dimensional array

Forecast results

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

typedef float type;
#define nfold 5
#define split 0.2

double predict(type *row,type *coefficients,int D){ //input two one-dimensional arrays with the length of D
    double yhat;
    int i;
    yhat=coefficients[0];
    for(i=0;i<D-1;i++)
    {
        yhat+=coefficients[i+1]*row[i];
    }
    return yhat;
}

Refresh coefficient by iteration

type *coefficient_sgd(type **array,type l_rate,type n_epoch,int D,int size2)//train set，train speed，train cycle
{
    double yhat,error;
    int i,j,k;
    type *coef;
    coef=(type *)malloc(D*sizeof(type));
    for(i=0;i<D;i++){
      coef[i]=0.0;
    }   //Initialize the coefficient matrix

    for(i=0;i<n_epoch;i++){
        for(j=0;j<size2;j++){
            yhat=predict(array[j],coef,D);
            error=yhat-array[j][D-1];
            coef[0]=coef[0]-l_rate*error;
            for(k=0;k<D-1;k++){
                coef[k+1]=coef[k+1]-l_rate*error*array[j][k];
            }
        }
    }
    return coef;
}

Calculate the test set results with the trained parameters

type *linear_regression_sgd(type **train,type **test,double l_rate,double n_epoch,int D,int size1,int size2)
{
    double yyhat;
    int i;
    type *predictions;
    type *coeff;
    predictions=(type *)malloc(size1*sizeof(type));
    coeff=(type *)malloc(D*sizeof(type));
	// Compute coefficients
    coeff=coefficient_sgd(train,l_rate,n_epoch,D,size2);
    
    for(i=0;i<size1;i++){
        yyhat=predict(test[i],coeff,D);
        predictions[i]=yyhat;
    }

    return predictions;
}

Calculate the final RMSE

double rmse_metric(type **test,type **train,double l_rate,double n_epoch,int D,int size1,int size2) //input prediction set and testing set
{
   double sumerror=0.0;
   double predictionerror=0.0;
   double meanerror;
   int i;
   type *Array;
   Array=(type *)malloc(size1*sizeof(type));
   Array=linear_regression_sgd(train,test,l_rate,n_epoch,D,size1,size2);

   for(i=0;i<size1;i++){
     predictionerror=Array[i]-test[i][D-1]; //calculate the error
     sumerror+=pow(predictionerror,2);
   }
   meanerror=sumerror/size1;

   return sqrt(meanerror);
}

3、Part of the main function

Cross validation

	......
        
    loaddata(&N,&D,&Sarray);// load dataset

   //Generate training dataset and testing dataset
    size1=split*N; // size of the training set
    size2=N-size1;// size of the testing set
    train=createarray(size2,D);
    test=createarray(size1,D);

    // Divide the dataset into five folds for cross validating
    // The first fold, the index of the row can divided by 5
    for(j=0;j<size1;j++){
      test[j]=Sarray[nfold*j];
    }
    for(i=0;i<size1;i++){
      for(k=0;k<4;k++){
       train[i*(nfold-1)+k]=Sarray[nfold*i+k+1]; 
      }
    }

Calculate and display the results

result1=rmse_metric(test,train,l_rate,n_epoch,D,size1,size2);
printf("The first fold rmse: %f \n",result1);

4、Result

The results are accurate with small error, and the results of cross validation are reasonable. The related functions of MLR algorithm are fully realized. Function comprehensive, code concise.

Ⅲ. K-Nearest Neighbors

1、Algorithm description (from book)

The k-Nearest Neighbors algorithm or KNN for short is a very simple technique. The entire training dataset is stored. When a prediction is required, the k-most similar records to a new record from the training dataset are then located. From these neighbors, a summarized prediction is made. Similarity between records can be measured many different ways. A problem or data-specific method can be used. Generally, with tabular data, a good starting point is the Euclidean distance. Once the neighbors are discovered, the summary prediction can be made by returning the most common outcome or taking the average. As such, KNN can be used for classification or regression problems. There is no model to speak of other than holding the entire training dataset. Because no work is done until a prediction is required, KNN is often referred to as a lazy learning method.

2、Related function

Abalone Dataset https://goo.gl/BDYgh5

Two dimensional array function and read data set function are the same as above
Calculation of Euclidean distance

#include<stdio.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>

#define K 5 //number of the neighbours
typedef float type;
#define nfolds 5

type computedistance(int n,type *avector,type *bvector)
{
    int i;
    type dist=0.0;
    for(i=0;i<n;i++)
        dist+=pow(avector[i]-bvector[i],2); //calculate the distance
    return sqrt(dist);
}

Bubble sorting

Reference：（https://www.cnblogs.com/LCcnblogs/p/6031218.html）

void bublesort(int n,type **a,int choice)  // K,karray，0/1
{
    int i,j;
    type k; //
    for(j=0;j<n;j++) // n-->k
        for(i=0;i<n-j-1;i++){
            if(0==choice){ //sort the first row
                if(a[0][i]>a[0][i+1]){
                    k=a[0][i];
                    a[0][i]=a[0][i+1];
                    a[0][i+1]=k;
                    k=a[1][i];
                    a[1][i]=a[1][i+1];
                    a[1][i+1]=k;
                }
            }
            else if(1==choice){ //sort the second row
                if(a[1][i]>a[1][i+1]){
                    k=a[0][i];
                    a[0][i]=a[0][i+1];
                    a[0][i+1]=k;
                    k=a[1][i];
                    a[1][i]=a[1][i+1];
                    a[1][i+1]=k;
                }
            }
        }
}

Get neighbor list

Reference：（https://www.cnblogs.com/LCcnblogs/p/6031218.html）

type orderedlist(int n,type *list) //K,array[1]
{
    int i,count=1,maxcount=1;
    type value; //
    for(i=0;i<(n-1);i++) {
        if(list[i]!=list[i+1]) {

            if(count>maxcount){
                maxcount=count;
                value=list[i];
                count=1;
            }
        }
        else
            count++;
    }
    if(count>maxcount){
            maxcount=count;
            value=list[n-1];
    }
    
    return value; // return max
}

3、Part of the main function

The core part

for(c=0;c<NN;c++){
        for(i=0;i<K;i++){
            if(K>N1) exit(-1);  
            
            // karray 
            karray[0][i]=computedistance(D-1,Array[c],train[i]); // calculate the distance 
            karray[1][i]=train[i][D-1]; //store the real result of the training set
        }

        bublesort(K,karray,0); // sort the distance
  
        maxdist=karray[0][K-1]; // record the maximum value after the first sort

       for(i=K;i<N1;i++){ //i=3--6
        dist=computedistance(D-1,Array[c],train[i]);
        if(dist<maxdist){
            for(j=0;j<K;j++){
                if(dist<karray[0][j]){
                    for(k=K-1;k>j;k--){ // Elements move in preparation for inserting the new value
                        karray[0][k]=karray[0][k-1];
                        karray[1][k]=karray[1][k-1];
                    }
                    karray[0][j]=dist;  //insert the value in the position j
                    karray[1][j]=train[i][D-1];
                   
                    break;  
                }
            }
       }
        maxdist=karray[0][K-1]; //sort again
        
      }
      bublesort(K,karray,1); // find the value with the most occurrences in the neighborhood

      predict[c]=orderedlist(K,karray[1]);

      if(predict[c]==Array[c][D-1]){
            scores1=scores1+1;
        }
        //printf("\n scores:%d \n",scores);
    }

4、Algorithm evaluation

The results of cross validation are not stable, and the overall accuracy is high. The algorithm programming is complex, some steps are complicated, there is still room for simplification.

Ⅳ. Naive Bayes

1、Introduction of the algorithm (from book)

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:

$P(class|data)=\frac{P(data|class) × P(class)}{P(data)} $

Where P(class|data) is the probability of class given the provided data. Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable. Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

2、Related function

The code for array creation and dataset import is the same as above.

Mean function

double meanlist(type *array,int size) // one-dimension matrix and the length of it
{
    double sum=0.0;
    int i;
    for(i=0; i<size; i++) //
    {
        sum += array[i];
    }
    return sum/size;
}

Calculate the standard deviation

double stdev(type *array,int size)  //size is the number of the array
{                                  
    double avg;
    int i;
    double variance=0.0;
    avg=meanlist(array,size);

    for(i=0; i<size; i++) //
    {
        variance+=pow(array[i]-avg,2);
    }
    variance=variance/(size-1);
    return sqrt(variance);
}

Calculate Gaussian Probability Density Function

double calculate_probability(type x,double mean,double stdev) // import the mean and std value
{
    double exponent;
    double res;
    double pi=3.14159;
    exponent=exp(-(pow(x-mean,2)/(2*pow(stdev,2))));
    res=(1/(sqrt(2*pi)*stdev))*exponent;
    return res;
}

Extract arrays of the same category

type **separate_by_class(type **array,int n,int d) // Separate samples that have the same label from the dataset
{
    type **separated=NULL;
    separated=createarray(n,d);
    type **separated1=NULL;
    separated1=createarray(n/3,d);
    type **separated2=NULL;
    separated2=createarray(n/3,d);
    type **separated3=NULL;
    separated3=createarray(n/3,d);

    //separated=array;
    int i;
    int t=0;
    int tt=0;
    int ttt=0;
    for(i=0;i<n;i++){
      if(array[i][d-1]==0){
        separated1[t]=array[i];
        t++;
      }
      if(array[i][d-1]==1){
        separated2[tt]=array[i];
        tt++;
      }
      if(array[i][d-1]==2){
        separated3[ttt]=array[i];
        ttt++;
      }
    }
    for(i=0;i<n/3;i++){
      separated[nclass*i]=separated1[i];
      separated[nclass*i+1]=separated2[i];
      separated[nclass*i+2]=separated3[i];
    }
    }
    return separated;
}

Calculate the probability that the sample belongs to a certain class

type *calculate_class_probabilities(type **array,type *row,int d,int n) //
 {                                                                                    // use 'cal_p' and 'summarizez_dataset' function
     int i;
     //int total_rows=N-N/nfold;
     type *probablity;
     probablity=(type *)malloc(nclass*sizeof(type));
     type **separated1=NULL;
     separated1=createarray(n/3,d);
     type **separated2=NULL;
     separated2=createarray(n/3,d);
     type **separated3=NULL;
     separated3=createarray(n/3,d);
     for(i=0;i<nclass;i++){
        probablity[i]=0.33;
     }
     for(i=0;i<n/3;i++){
      separated1[i]=array[nclass*i];
      separated2[i]=array[nclass*i+1];
      separated3[i]=array[nclass*i+2];
     }
     //After initialing
     type **summary=NULL;
     summary=createarray(d,nclass); //5*3

     // the probability of label one 
     summary=summarize_dataset(separated1,D,40);

     for(i=0;i<d-1;i++){
       probablity[0]=probablity[0]*calculate_probability(row[i],summary[i][0],summary[i][1]);
     }
    
     // the probability of label two
     summary=summarize_dataset(separated2,D,40);
     for(i=0;i<d-1;i++){
       probablity[1]=probablity[1]*calculate_probability(row[i],summary[i][0],summary[i][1]);
     }
	
    // the probability of label three
     summary=summarize_dataset(separated3,D,40);
     for(i=0;i<d-1;i++){
       probablity[2]=probablity[2]*calculate_probability(row[i],summary[i][0],summary[i][1]);
     }

    return probablity;
 }

Compare the probability and predict results

double predict(type *array)
{
    int i;
    double result;
    double max;
    max=array[0];
    result=0;
    for(i=0;i<nclass;i++){
        if(array[i]>max){
            max=array[i];  // get the maximum probability
            result=i;
        }
    }
    return result;
}

3、Part of the main function

Algorithm score

int evaluate_algorithm(type **train,type **test,int size2,int size1,int d)
{
    int i;
    int scores=0;
    type *ppre;
    ppre=(type *)malloc(size1*sizeof(type));

    ppre=naive_bayes(train,test,size2,size1,d);
    for(i=0;i<size1;i++){
        if(ppre[i]==test[i][d-1]){
            scores+=1;
        }
    }
    //printf("%d",ppre);
    accuracy1=scores/size1;
    return scores;
}

Cross validation (take the first folder as an example)

   // The first folder
    for(j=0;j<size1;j++){
      test[j]=Sarray[nfold*j]; // store the index of the row that can be divided by 5
    }
    for(i=0;i<size1;i++){
      for(k=0;k<4;k++){
       train[i*(nfold-1)+k]=Sarray[nfold*i+k+1]; 
      }
    }
    scores=evaluate_algorithm(train,test,size2,size1,D);
    accuracy1=(float)scores/(float)size1;
    printf("accuracy1: %f \n",accuracy1);

4、The results of the algorithm

Evaluation

Without dict (), we have completed all the functions and steps of the algorithm, and realized cross validation and the algorithm has a high accuracy. The optimization space of the algorithm is small.

X-MingSDU/Machine-Learning-Algorithm

Machine learning algorithm(by using C language)

Ⅰ. Simple linear regression algorithm

1、Algorithm description (from book)

2、Function implementation

3、Part of the main function

4、Results and estimate

Ⅱ. Multivariate Linear Regression

1、Algorithm description (from book)：

2、Related function

3、Part of the main function

Ⅲ. K-Nearest Neighbors

1、Algorithm description (from book)

2、Related function

3、Part of the main function

Ⅳ. Naive Bayes

1、Introduction of the algorithm (from book)

2、Related function

3、Part of the main function

4、The results of the algorithm