/Document_Clustering

Given a Dataset of Research Papers, this automatically clusters them based on their Field or Domain of Research

Primary LanguagePython

This is a Document Clustering Project
In an Abstract sense the Input For this project is a set of Research Papers and the Number of clusters.
The output is the Clusters containing the Research Papers Names

Explanation About the Different Codes Used :
    1) pdf2text.py code converts the Research Paper which is in PDF format to txt format
    2) convertAllPdf2Text.sh takes as Input a directory containing all the Research Papers in PDF format
       and converts them to txt format and stores all those txt files in a folder called 'TextFiles'
       in the same directory
    3) tidf.py takes as input the directory name containing the Text Files and the number of clusters.
       The output which is the name of the Research Papers is printed to the output

Sequence of Running the Code :
    1) Run convertAllPdf2Text.sh
    2) Run tidf.py with the respective arguments

The Results for different clusters can be found in the Results Directory