Follow topic_modelling.ipynb file for details.
We are given 32 pdf documents - which are research paper related to LENR (Low energy nuclear reactions). We do preprocessing and do topic modelling through clustering, tfidf and chatgpt api's.
-
Data Cleaning.
- Convert PDFs to text, and extract the title and abstract from each document.
- Programatically preprocess the texts, which involves removing stop words, removing special or unusual characters lowercasing, lemmatization.
-
Embedding Vectors.
- Word Embeddings (Word2Vec).
- Document Embeddings (Doc2Vec).
-
Cosine Similarity using Embedding Vectors.
-
Dimensionality reduction through PCA.
-
Clustering through Kmeans.
-
Class-based TF-IDF.
-
Improve topic representations through LLM (chatgpt 3.5 turbo).
Colab Notebook link: here