A Baseline Recommendation Model using Transformers' Zero Shot Classifier. I made this as part of a technical roadmap for a Minimal Viable Product to present as a startup for an accelerator project. The startup was accepted by the accelerator at Masason Foundation. Here is an excerpt from the technical roadmap that I wrote.
Our initial application will be a recommender system that recommends the users papers based on their qualifications and interests. We will start by first focusing on content-based filtering, which then will be complemented by collaborative-based filtering features as the user interaction starts generating more data.
Upon signing up to the site, users will be asked to provide a list of topics that they have expertise in, as well as additional information such as academic degree, that can be used to improve their recommendations.
For the content-based filtering, we have developed an algorithm that classifies papers into the most relevant categories through natural language processing. We will then show papers to users who have indicated interest or expertise in a given category.
The available category list is obtained from the arXiv dataset, however, our current method of classification allows us to add new labels (categories) at any given time without the need for additional model training.
We have run experiments on the model. Here are the details:
Dataset: arXiv dataset
Preprocessing: While the pipeline includes tokenization, additional preprocessing measures are essential to ensure the best possible results from the model. Our experiments have shown that the pipeline performs consistently better with the following preprocessing steps.
Preprocessing steps:
- Lemmatization of the words
- Removing stop-words (such as "of, the, is, was, ...")
- Making all characters lowercase
- Removing special characters
- The Tokenization is accounted for by the model pipeline.
- An upcoming preprocessing feature will be LaTeX recognition (to make sure LaTeX formulas are not cleared from the text but are stored separately) which will be used to estimate the mathematical complexity of the paper.
You can the cleared version of the dataset here: https://drive.google.com/file/d/1QVw0UZXKgKTWAp0ZssM2k1oDiQ2NG6K_/view?usp=sharing
Here is a simple example of an abstract, its original category and the recommended categories:
A screenshot of the abstract from ArXiv
A snippet showing the true category of the paper and 3 categories recommended by the algorithm.
Model: We are starting with the zero-shot-classification pipeline from the Hugging Face Transformers library. As a pre-trained natural language processing tool, the pipeline will be used as a baseline solution which then can be replaced by a more sophisticated and better-customized model.
Possible Extensions: Customizing and improving this model, incorporating information from LaTeX formulas, analyzing figures, using paper references and graph algorithms to improve this model.
Importance: The value of the baseline solution is that it can be implemented quickly with little cost and serves as a minimum point of comparison for more advanced models that we aim to implement. It does not require additional training even if new categories are added. In the optimization and extension steps, the recommendation system will be further finessed with the use of hybrid (collaborative and content-based) filtering, which will look into not just the abstract, but also additional metadata generated by the system (further explained below).