/NLP-on-scientific-articles-for-information-retrieval

NLP on scientific articles to create the vector space model, and the information retrieval was realized using cosine similarity ranking.

Primary LanguageJupyter Notebook

Picture2

NLP-on-scientific-articles-for-information-retrieval

This experiment will work on language modeling to provide a way of giving a token of identification to research articles which facilitates recommendation and search process.

Key work on this experiment is word embedding. I will test several wording embedding methods such as TF-IDF and word2vec. And then, I will evaluate the classification modeling performance on vectors from different word embedding. The best word embedding should be able to reasonably separate the articles and give out the best classification prediction. I will use the best word embedding method to realize information retrieval and produce a web APP for Scientific-articles-search.

https://github.com/zhaoxin1124ds/NLP-on-scientific-articles-for-information-retrieval/blob/main/NLP_information_retrieval.pdf

Data

The data is from Kaggle and contains abstract and title for a set of scientific papers. All the papers have been labeled according to 6 topics: computer science, physics, mathematics, statistics, quantitative biology, and quantitative finance.

Experiment procedure

Data pre-processing

I removed the numbers, puntuations, extra white space, new lines, etc.

Word embedding and classification

I will test below word embedding algorithms:

  • TF-IDF
  • word2vec
  • Since the document is labeled, I will acquire classification score to evaluate difference vector space models

Information retrieval

I will use cosine similarity ranking to realize the information retrieval

Results

  • The classification accuracy score is 0.84
  • The product of information retrieval returns reasonable recommendations for anu queries

More thinks for further improvement

  • Larger documents to improve the balance
  • Playing with weight between the title and abstract
  • BERT for a more advanced word embedding attempt