Topic Modelling of PDF data.

Follow topic_modelling.ipynb file for details.

We are given 32 pdf documents - which are research paper related to LENR (Low energy nuclear reactions). We do preprocessing and do topic modelling through clustering, tfidf and chatgpt api's.

Data Cleaning.
- Convert PDFs to text, and extract the title and abstract from each document.
- Programatically preprocess the texts, which involves removing stop words, removing special or unusual characters lowercasing, lemmatization.
Embedding Vectors.
- Word Embeddings (Word2Vec).
- Document Embeddings (Doc2Vec).
Cosine Similarity using Embedding Vectors.
Dimensionality reduction through PCA.
Clustering through Kmeans.
Class-based TF-IDF.
Improve topic representations through LLM (chatgpt 3.5 turbo).

Colab Notebook link: here

mgokulkrish/Topic_Modelling

Topic Modelling of PDF data.