roarnes/medical_data_text_feature_extraction

This experiment is based on medical data set consisting of 93 training data and seven testing data. Each data represent symptoms and their diagnosis. The aim is to find the label for the test data, based on the featured extraction derived from the training data. The feature extraction is done by calculating the similarity of the occurrence of each word in both data. In information retrieval or text mining, the term frequency – inverse document frequency (also called tf-idf), is a well know method to evaluate how important is a word in a document. There are two approaches in obtaining the similarity, namely the Manhattan distance calculation and the cosine similarity.

Python

This repository is not active