/Project_Iwannidhs_Fall-2020

This is a team project for the course of Information Systems, Fall 2020 - 2021, with the purpose to create 3 separate 'small' projects that have a 'connection' with each other and will result into the final project which will be a complete Information System project.

Primary LanguageCMIT LicenseMIT

Information Systems Project

Fall 2020 - 2021

Open Source


📁 Project File Structure


Compilation:

$ make

Execution:

$ ./project2 ../dataset/sigmod_large_labelled_dataset.csv

  or

$ ./project2 ../dataset/sigmod_medium_labelled_dataset.csv  

(Execution time on large dataset = 3sec)

To delete binary files:

$ make clean

Project1 Analysis:

First of all, this project uses the libjson-c-dev library, so you need to install it on your system by providing (for ubuntu systems) :

$ sudo apt install libjson-c-dev

For non ubuntu systems adjust accordingly, Info.


The whole project is implemented in a Linux operating system (Ubuntu 20.04) and the compiler used is (gcc version (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0)

The data structure that is used on this 1st project is a binary search tree whose nodes consists of linked lists of buckets.
The program at first opens the dataset directory where the camera specs are located and for each and every subdirectory it opens it and searches and opens each and every one of the json files.
Afterwards we insert each and every of the json files inside the 1st tree node (i.e. bucket) and when the 1st bucket is full of json files we allocate proper memory from the heap to create a new bucket to insert the remaining json files.

The above operation described is dynamic (not hardcoded), and that means that it will keep adding json files by creating the linked list of buckets on each and every node of the tree based on the total json files that are provided to the application.
After we're done and all the files are inserted we're deallocating the memory memory that we have previously located from the heap!

Our program, checks (as requested from the project description) the files with each other and 'prints' into a .csv file the relation of each and every file with each other, (i.e. all the json files that are the same with each other based on the specs that are contained inside).


Details

Following the path:
src ~> dot-C_files ~> W_handler.c
We have:

void insert_entry(tree_entry *entry, bucket* first_bucket);

who's responsible for adding the pointers (relations) inside the buckets of the tree.

void add_relation(tree_entry* root, char* json1, char* json2, int relation);

who's responsible for comparing the specs of the .json files and add their relation (the ones that are the same) to our linked list of buckets.

void print_all_relations(tree_entry *,FILE*);

who's responsible for printing out into the .csv all the specs that consist of the same commodity.

Inside:
src ~> dot-C_files ~> structs.c
We have implemented our data structures described above and whatever helper function we needed and they consist of the following:

  • Helper function that compares our string to be inserted in the tree
    • returns 0 if a is smaller, 1 if its bigger, 3 if they are the same (should never return 3)
int compare(char* a, char* b);
  • The data structures and their operations:
tree_entry *insert(tree_entry *T, int x, char *path_with_JSON, char* json_specs);
tree_entry *search(tree_entry *T, char* x);
tree_entry *Delete(tree_entry *T,int x);
int height(tree_entry *T);
tree_entry *rotateright(tree_entry *x);
tree_entry *rotateleft(tree_entry *x);
tree_entry *RR(tree_entry *T);
tree_entry *LL(tree_entry *T);
tree_entry *LR(tree_entry *T);
tree_entry *RL(tree_entry *T);
int BF(tree_entry *T);
void preorder(tree_entry *T);
void inorder(tree_entry *T);
void free_node(tree_entry *T);
  • Helper Function to read and store the contents of the json files.
    • It takes as input a .json file it parses the json object of each and every file and then converts it into a C string format, it stores that string by allocating proper memory (the size of each json object) and then stores it into our tree buckets
char* read_json(char* json_filename);

Unit Testing Part_1 & Part_2

Following the path:
src ~> test ~> simple_test.c
We have:

void test_compare(void);
  • responsible for testing the function compare.
void test_height(void);
  • responsible for testing the function height.
void test_BF(void);
  • responsible for testing the function BF.
void test_read_json(void);
  • responsible for testing the function updated-improved read_json.
void test_insert(void);
  • responsible for testing the function insert.
void test_insert_entry(void);
  • responsible for testing the function insert_entry.
void test_search(void);
  • responsible for testing the function search.
void test_clique_tree_search(void);
  • responsible for testing the function clique_tree_search.
void test_clique_tree_insert(void);
  • responsible for testing the function clique_tree_insert.
void test_add_positive_relation(void);
  • responsible for testing the function add_positive_relation.
void test_add_negative_relation(void);
  • responsible for testing the function add_negative_relation.
void test_create_bow_tree(void);
  • responsible for testing the function create_bow_tree.
void test_create_tf_idf(void);
  • responsible for testing the function create_tf_idf.
void test_create_train_set_bow(void);
  • responsible for testing the function create_train_set_bow.
void test_create_train_set_tfidf(void);
  • responsible for testing the function create_train_set_tfidf.
void test_print_node_relations(void);
  • responsible for testing the function print_all_positive/negative_relations.

  • Helper Function to compare 2 .csv files.

    • It takes as input 2 .csv files and compare them char by char. Return 0 if the files are same and -1 if files are different.
int helper_compareFile(FILE * fPtr1, FILE * fPtr2, int * line, int * col);

Run the Unit Testing:
How to compile the Unit tests

  • gcc simple_test.c ../dot-C_files/structs.c ../dot-C_files/W_handler.c ../dot-C_files/json_parser.c ../dot-C_files/train_set_handler.c -o simple_test -ljson-c -lm
    How to run the tests
  • We can run all tests in the suite
    • ./simple_tests
  • We can run only tests specified
    • ./simple_tests insert
    • ./simple_tests add_relation

  • The code is thoroughly commented wherever needed for it's better comprehension.
  • Thank you! 😄

Part 2 of the Project

From here and on we'll continue the documentation of the part2 of the project.

Negative Relations

For the negative relations, a tree is added to the headbucket of every clique. Every node of that tree has a pointer to other cliques(to their headbucket specifically). These pointers are effectively connecting different cliques, and are created during the reading of the W file.

Parsing

First and foremost the parsing is now completely made from scratch and without any use of external libraries as it was on the 1st part of the project. For json parsing we first read every json and put all its contents in a string just like project1. Then parse it line by line and only keep the words by removing the labels and symbols surrounding the lines.

Creating BoW-IDF

Dictionary

The dictionary is stored in a AVL tree. Every node of the tree holds the string of the word, an array of wordcounts(which represents the column of the BoW representation for that word) and an integer key of that word(so that we can use arrays later). The dictionary tree is created by reading every word of every json, preprocessing it and updating/creating tree nodes based on it.

Preprocessing

The preprocessing is composed of a series of functions that alter the word, by:

  • lowercasing it
  • removing symbols from it, for instance "camera)," becomes "camera" (not every symbol though, because for example we don't want "3.5" to become "35" as it changes the word meaning)
  • ignoring really big words (uses MAXWORDSIZE = 10 in train_handler.h file, changeable)
  • the stopwords are removed after the BoW tree is created

Creating BoW and TF_IDF arrays

The BoW array is created by recursively iterating the BoW Dictionary, and combining the columns of non stopwords. The TF_IDF array is created by first iterating the BoW array and getting the necessary extra data (number of words in a json for tf, and number of jsons that contain a specific word atleast once for idf). The rest data needed for tf and idf(bow entries and number of jsons) we already have. Then the TF_IDF array is allocated and filled based on the data above.

Improving Bow and TF_IDF

Because the different words are way to many even with preprocessing (about 30000) we need to select the most "important" ones, so that the train and test sets aren't huge, and we avoid uneccessary noise.

This is done by saving the idf value of every word column to an array (during the TF_IDF creation), and the only using the ones with the lowest value(most importance). This sorting functions returns the indexes of the best word columns, which are used to create the improved arrays.

Creating Machine learing data

The dataset is created based on the positive_relations.csv, the negative_relations.csv and the improved BoW or TF_IDF (based on what the user asks for). Every relation is translated to the subtraction of the two vectors of its jsons, creating a vector for ML. These vectors are split to Train_Set.csv and Test_Set.csv (4/5 to train and 1/5 to test) to be used for Machine Learning.

Compilation:

$ make

Execution:

$ ./project2 bow

  or

$ ./project2 tf_idf  

Machine Learning

Logistic Regression

First and foremost the machine learning (logistic regression) part of the project is implemented in standard C++ without using any external libraries!

  • To compile change directory into the MachineLearning directory and provide the following:
$ g++ -o log_regr main_log_regr.cpp
  • To run:
$ time ./log_regr

The logistic regression part of the project is implemented following the project's description without any further modifications. (Source code is well commented for further understanding.)

Unit Testing (for logistic regression)

The unit testing made for the machine learning part of the project is done using Googletest - Google Testing and Mocking Framework.
Installation instructions for the framework can be found here ~> gtest.

  • To run the tests:

Get into the src/MachineLearning/tests directory and provide the following:

  • Compile:
$ g++ ML_test.cpp -lgtest -lgtest_main -pthread 
  • Run:
$ ./a.out

🧑 Nikitas Sakkas, 🧔 Andrew Pappas, 👩 Konstantina Nika
©️ 2020 - 2021, All Rights Reserved.