This is a Document Clustering Project In an Abstract sense the Input For this project is a set of Research Papers and the Number of clusters. The output is the Clusters containing the Research Papers Names Explanation About the Different Codes Used : 1) pdf2text.py code converts the Research Paper which is in PDF format to txt format 2) convertAllPdf2Text.sh takes as Input a directory containing all the Research Papers in PDF format and converts them to txt format and stores all those txt files in a folder called 'TextFiles' in the same directory 3) tidf.py takes as input the directory name containing the Text Files and the number of clusters. The output which is the name of the Research Papers is printed to the output Sequence of Running the Code : 1) Run convertAllPdf2Text.sh 2) Run tidf.py with the respective arguments The Results for different clusters can be found in the Results Directory
Bhargavasomu/Document_Clustering
Given a Dataset of Research Papers, this automatically clusters them based on their Field or Domain of Research
Python