This is my work for the Fetch Rewards NLP challenge. Here is the instructions:
You will build a tool that allows users to intelligently search for offers via text input from the user.
You will be provided with a dataset of offers and some associated metadata around the retailers and brands that are sponsoring the offer. You will also be provided with a dataset of some brands that we support on our platform, and the categories that those products belong to.
Acceptance Criteria:
-
If a user searches for a category (ex. diapers) the tool should return a list of offers that are relevant to that category.
-
If a user searches for a brand (ex. Huggies) the tool should return a list of offers that are relevant to that brand.
-
If a user searches for a retailer (ex. Target) the tool should return a list of offers that are relevant to that retailer.
-
The tool should also return the score that was used to measure the similarity of the text input with each offer
- Note that the website may be down because it is limited by 1gb of data and the csv files are large enough to sometimes crash it https://nlp-store-offers.streamlit.app/
1.Clone repository:
git clone https://github.com/RileyFischer/NLP-store-offers
2.Go to directory:
cd (directory location)
- install libraries:
pip install -r requirements.txt
- run streamlit app:
streamlit run main.py
- Open main.ipynb to see the code and explore
- Streamlit based web interface
- Data Cleaning
- Semantic similarity: Using pretrained models to find the semantic similarity of text.
- Social Network Analysis: Using the relationship of categories to create and use a social network and grouping searches into relevant categories.
- Fast search: Each search should just take a second or two at most
- Single search bar: By having one search bar the user is presented with a simple and intuitive app that they can use to search for whatever they're thinking of and get offers back, without thinking about whether they want to know about categories, retailers, or brands.
-
The search is done seperatly by category, brand, and retailer and then the searches are combined by score so that the user can just do one search without differentiating what type of search they want to do.
-
One assumption I made is the the receipts column coresponds to the number of times an offer within that category for the brand has been used.
-
For the cosine similarity score I used the sentence transformer "multi-qa-mpnet-base-cos-v1" on the offers, brands, retailer, and search strings. I then measured the similarity of vectors from the sentence transformer by using cosine similarity.
-
The use of the square roots and dividing by 2 or 3 is put in place so that all scores have a range of -1 to 1.
-
Each search will score every possible offer. While we would want to limit the number of offers to just show a few of the top scoring offers, by scoring every offer it is possible for the user to keep scrolling through offers untill they find one they like.
-
For each score I am using multiple metrics and combining them. For example in the retailer score I just combine all three and weigh them equally. While this seems to be effective, I think if there was more data and true score metric that it would be a good idea to work on developing a model to treat each metric as a seperate feature and find how they can be combined in a more thoughtful way to reach a final score metric.
-
Brands are treated equally regardless of their total number of receipts. The benefit of this is that the offers presented are based on relevance so we should always be getting the most relevant offer possible. However this can be bad since people are less likely to shop at small brands, therefor having the list be flooded with small brands compared to large brands makes the offers less relevant to the person. I choose to assume people would be just as interested in offers from small brands compared to large brands, but in reality I think it would make sense to assume people prefer offers from large brands since more people already shop there.