This repository contains the solution for the Harvard Business School take-home assignment about retrieval.
Folder/File | Description |
---|---|
data/ |
Should contain the label.csv , product.csv and query.csv files to work. |
notebooks/ |
Contains all the Jupyter Notebooks used for solving the assignment |
notebooks/HBS_retrieval_assignment.ipynb |
Base TF-IDF solution. |
notebooks/01_openai_solution.ipynb |
OpenAI embeddings solution. |
notebooks/02_open_source_embeddings.ipynb |
Open source embeddings solution. ❗ USES Metal Performance Shaders (MPS) |
notebooks/03_fine_tuneembeddings.ipynb |
Fine-tuning open source embeddings solution. ❗ USES GPU |
src/ |
Contains all Python modules used for solving the assignment |
.env.copy |
Template for environment variables |
Makefile |
Contains shortcuts for various commands |
-
To improve the MAP@10 score of the search engine, several approaches can be considered. Utilizing pre-trained embedding models, both closed-source (like OpenAI's offerings
notebooks/01_openai_solution.ipynb
) and open-source options (notebooks/02_open_source_embeddings.ipynb
), could enhance the semantic understanding of queries and documents. Fine-tuning these open-source models (notebooks/03_fine_tuneembeddings.ipynb
) on domain-specific data might further improve performance. Additionally, implementing the BM25 algorithm, known for its effectiveness in information retrieval, could boost relevance scoring. A hybrid approach combining multiple techniques might yield the best results, leveraging the strengths of each method to create a more robust search system. -
Regarding the evaluation of partial matches, implementing a function that assigns half a point for partial matches and a full point for exact matches could provide a more nuanced assessment of the search engine's performance. This approach acknowledges that partial matches still hold value, albeit less than exact matches, offering a fairer representation of the system's capabilities. Such a scoring method strikes a balance between rewarding precision and recognizing the relevance of close matches, potentially providing a more comprehensive view of the search engine's effectiveness. However, it's important to consider that this method might slightly inflate scores compared to stricter evaluation metrics, so using it alongside other established metrics could offer a more holistic performance assessment. This function was implemented in
src/evaluation
.
Model | MAP@10 (Strict) | MAP@10 (Flexible) |
---|---|---|
TF-IDF | 0.2931 | n/a |
OpenAI Embeddings | 0.3867 | 0.5815 |
Open Source Embeddings | 0.3212 | 0.5212 |
Fine-tuned Embeddings ⭐ | 0.3939 | 0.6125 |
⭐ Fine tuning achieves score above 0.6, which is stated in the assignment as production level.
- Python 3.11
- Poetry
-
Install Python 3.11:
pyenv install 3.11
-
Install Poetry:
pip install poetry
-
Install project dependencies:
poetry install
-
Copy
.env.copy
to.env
and fill in the required environment variables.