Presentaion slide: http://www.slideshare.net/AnkurShrivastava9/document-classification-and-clustering

SMAI Major Project

Project: Document Classification and Clustering

**Project Number: 3 **Group Name: team_arv Group Members: Ankur Shrivastava (201405551) Ritesh Modi (201405518) Vinayak Bharti (201405522)

Abstract:

This project provides document clustering over a heterogeneous dataset. Document clustering is based on features such as TF-IDF, PoS, NER, length of document, document sentiment, type of media, source type etc.

Project Scope:

Document clustering aims at helping practitioners to take decisions. The goal of a good document clustering scheme is to minimise intra-cluster distances between documents , while maximising inter-cluster distances. The outcome of the project would be relevant document clusters presented in visual form which would be human understandable thus helping data analysts to take decisions. Here input will be corpus of documents crawled from the web, which can be html, text, pdf and other types.

Proposed System/Approach:

Initially the heterogeneous data set would be provided. This would be processed and converted into homogeneous form. Further this homogeneous data set would be clustered based on different relevant features. The project would be version controlled using Git.The project would be implemented in Java language.

How To Run Project:

  1. Import project as java project in eclipse.
  2. Before running project, create folder path like 'resources/raw_data' inside current project folder and copy all documents in that raw_data folder.
  3. Run com.extraction.MainFile.java as Java Application
  4. Choose your features.
  5. Final output will be generated in current project folder with 'clustered' named.