/SearchAndRecommendationSystem

We wanted to build a search and recommendation system for user queries related to local businesses. The main goal is to fulfil the need of the user and satisfy the user's query with corresponding matching results. We have built a generic search infrastrucure that can be extened to any domain but for the purpose of demonstration and for the purpose of this project we have limited the scope to Restaurants entities in state of Pennsylvania.

Primary LanguagePython

Search and Recommendation system

Adhoc search is one of the most common scenarios to satisfy users need. We want to build a search and recommendation system for user queries related to local business. The main goal was to fulfil the need of the user and satisfy the user query with corresponding results. The review data associated with the business is a rich information source that can be mined and used to infer meaning, business attributes, and sentiment. We will implement sentiment analysis for the reviews of the business and extract top positive and top not so positive reviews. We will scope our search and recommendation system to “Restaurants” segment. For example if user searches for pet friendly restaurant with good Indian food near Bothell, we should be able to suggest existing pet friendly Indian restaurants in Bothell.

We analyzed the data for local business and reviews from : https://www.kaggle.com/yelp-dataset/yelp-dataset

The goal was to pick one location which has medium size business review dataset. The date for state of Pennsylvania was a perfect fit for this project. This has around 10,000 local businesses with around 260,000 reviews.

Project demo

The project is working end to end and an UX interface is hosted at https://sars.azurewebsites.net/index.html

The project is primarily composed of two parts.

  • Topic modelling and Sentiment analysis.
  • An API providing search capabilities and UX interface for demo.

Let's understand these components in detail.

Topic modelling and Sentiment analysis

This section covers the Topic modelling and Sentiment analysis pipeline of the project.

Introduction

This module is responsible for carrying out data extraction, topic modelling and sentiment analysis.

Project files

Here are details of all code files and project structure

File/Folder Description
extraction.py This file extracts the core attributes from yelp dataset.
generateTopicModel.py This file generates tokenized review data obtained from data extraction pipeline of the project.
sentimentAnalysis.py The file generates the sentiment score for a review using VADER sentiment analysis.
yelp_pennsylvania_businesswithreview_dataset.txt This folder contains the Pennsylvania data from Yelp dataset.
yelpReviewSentimentsScore.txt This file contains the sentiment score generated by sentimentAnalysis.py.

Workflow

There are 2 seperate workflows for TopicModelling and Sentiment Analysis TopicModelling workflow calls generateTopicModel.py to generate the topics for a business Sentiment Analysis calls sentimentAnalysis.py to generate the sentiment score.

To execute run following commands on individual files

python generateTopicModel.py
python sentimentAnalysis.py  

API

This section covers the API component of the project. API is responsible for indexing, ranking and search functionality of the project.

Introduction

A search engine web API for indexing and searching review topics and providing location based results to callers. This API is used by Search and Recommender System - UX to display search and recommendations based on user inputs and location.

API is written in python and it runs on Flask web framework.

Project files

Here are details of all code files and project structure

File/Folder Description
businessreviews This folder contains the review data.
./businessreviews/businessreviews.dat This file has the business reviews tokenized data obtained from data extraction pipeline of the project. The data is formatted as line-corpus and each line is represented as JSON string.
./businessreviews/line.toml line.toml config file for MeTA.
data This folder contains the reverse lookup data.
./data/yelp_pennsylvania_business_ recommendation_dataset.json This file contains the reverse lookup data for each business. This is data is generated by extraction pipeline of the project and is in JSON format. This information is used in response from API.
application.py Python module defining Flask application and routes. It exposed one GET api named Search.
application.test.py Python module defining tests function. Super useful for debugging and running outside of Flask environment. This module is also used to generate inverted_index and posting files offline/upfront, without API being invoked.
apptrace.py Python module for collecting and printing trace logs.
config.toml MeTA configuration file. It defines the data file, analyzer settings, etc.
controller.py Python module for extracting request parameters, calling indexer, calling lookup and preparing API response.
indexer.py Python module for indexing, ranking and searching using metapy.
lookup.py Python module for loading and performing lookup by business Ids.
requirements.txt List all required python modules to be installed by pip.
settings.py Python modules defining application configuration settings.
stopwords.txt List Stopwords for MeTA indexing.

Configuration

Configuration for API is in file settings.py

Here is the description for all configuration settings

Configuration Data type Description
debugMode bool When set to Trueapplication collects logs. These logs can be retrieved in response by passing enabletrace in request query.
lookupFilePath string Relative path to lookup file.
maxSearchResults int Number of top items to fetch while searching.
cfg string path to MeTA config file. It's a config.toml file.
datasetKey string key for review documents and lookup data.
bm25_K1 float OkapiBM25 K1 parameter
bm25_b float OkapiBM25 b parameter
bm25_k3 float OkapiBM25 k3 parameter
useJsonExtraction bool When set to True application performs a JSON parsing on document retrieved after search. Document is required to be formatted as JSON string for this to work.

API workflow

Once the request is received by Search method in application.py, the control is passed to Search method of Controller class.

Controller.Search then extracts the parameters from request and load lookup and indexes.

It then invokes queryResults method of Indexer class. queryResults method instantiated a OkapiBM25 ranker and then score and ranks the documents based on query. Method then iterates over ranked document Ids results and builds a list of business Ids based on ranked list. The list of business Ids is then returned back to caller.

Controller.Search then invokes method documentLookup of Lookup class. This method returns the business information based on business Ids passed in.

Controller.Search then perform a filter on location data and splits the query results into searchResults and recommendations.

The searchResults and recommendations are then sent back as JSON response from API.

API request

API HTTP request will look like

GET <server_endpoint_address>/v1/search?text=good+pizza&city=Pittsburgh&state=PA
Parameter Description Required
text Search Query mandatory
city City for query optional
state Two letter State for query optional
enabletrace If passed, response also contains application trace logs when server application debugMode is set to True optional

API response

API response fields

{
    "searchResults" : [
        {
          "address": "422 Greenfield Ave",
          "averageUserRating": "4.526315789473684",
          "business_id": "2d9yZ11uVa83OEQWxe4vlQ",
          "categories": "Pizza, Restaurants",
          "city": "Pittsburgh",
          "name": "Conicella Pizza",
          "reviewCount": "38",
          "sentiment": "0.8938315789473686",
          "state": "PA"
    	}
    ],
    "recommendations": [
        {
          "address": "3939 William Penn Hwy",
          "averageUserRating": "4.5576923076923075",
          "business_id": "X4QbkHl7pOTVsgLa7XuUBg",
          "categories": "Gluten-Free, Restaurants, Pizza, Salad, Fast Food",
          "city": "Monroeville",
          "name": "Blaze Fast Fire'd Pizza",
          "reviewCount": "104",
          "sentiment": "0.8639048076923076",
          "state": "PA"
        }
    ]
}

We have hosted this application at following endpoint http://13.77.179.28.

Feel free to call this API for example

GET http://13.77.179.28/v1/search?text=good%20pizza&city=Pittsburgh&state=PA

To host and run the application yourself, please follow through rest of this file.

System Requirements

We have used Ubuntu Server 18.04 LTS as our server. However, API should work on any stable distribution of Linux. Please ensure that you have following installed on your Linux machine.

  • Python3

    • This comes by default in Linux
  • Nginx

    • It is a very popular HTTP, reverse proxy server, and a generic TCP/UDP proxy server

Modules

All required python modules are listed in file

./sarsapi/requirements.txt

The modules can be classified into following

Area Modules Description
Python virtual env wheel, venv These are used for creating virtual environment.
Python MeTA bindings metapy, pytoml Modern Text Analysis libraries
Flask and related Flask, itsdangerous, Jinja2, MarkupSafe, Werkzeug, flask-jsonpify, flask-restful, flask-cors Flask web framework modules
Web Server gunicorn WSGI HTTP server for Python apps.. metapy libraries are not compatible within a Flask development server and hence gunicorn is needed.

Installation

Clone this repository to a folder and then open shell and change directory to folder sarsapi

Let's ensure pip is installed. On the shell prompt execute

sudo apt-get install python3-pip

Make sure pip and setuptools are the latest version

pip3 install --upgrade pip setuptools

Install python3-dev and python3-virtualenv

sudo apt-get install python3-dev python3-virtualenv

We installed virtualenv and pip to handle our application dependencies. Let's create virtual environment.

We will create the environment in current folder represented by . (dot) and activate it.

python3 -m venv .
source ./bin/activate

Our prompt will change when the virutalenv is activated and will look like this

(sarsapi) premp3@linux:~$ 

Now, lets ensure we have latest pip in virtual environment

pip install --upgrade pip

Now we will install all listed modules from file requirements.txt

pip install -r requirements.txt

When this command finishes, you will have all the required modules installed and we are ready to start the API.

Before starting the API, let's create the index files which will be used by API to perform search. This step saves time when API starts as required indexes and postings are created upfront.

File *application.test.py has the code to initiate creation of inverted index and perform a test search. Let's run this file in python

python ./application.test.py

When this execution is complete, we are ready to launch our API.

We could run our app with the Flask development server using the python application.py command. However, metapy libraries are incompatible and hence we will run the Flask app with Gunicorn.

To start the API, execute

gunicorn -w4 application:app

Our API is now be running at address http://127.0.0.1:8000

Now, let's expose this API endpoint on port 80 using Nginx web server

Install Nginx

Execute following to get Nginx installed

sudo apt update
sudo apt install nginx

After installation is complete, enable firewall rule for nginx

sudo ufw allow 'Nginx HTTP'

We are only allowing HTTP endpoints, if needed we can also both HTTP and HTTPS using Nginx Full.

Now, lets configure nginx to proxy for our python gunicorn server running on port 8000

Configure Nginx

Execute following to edit Nginx configuration

sudo nano /etc/nginx/sites-available/default

Then edit the content as below.

server {
    listen 80;
    server_name _;
    
    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
  }

Save the file and then restart nginx.

sudo systemctl restart nginx

That's it.

Our API is now available on port 80 i.e. http://127.0.0.1 or whatever is the external IP or address.

UX

This section covers the UX component of the project.

Introduction

A UX is responsible for getting users query and their location and invoke API to get search and recommendations. This is the interface with which users can interact with the system.

UX is a single page web application implemented in React.

Project files

Here are details of all code files and project structure.

Project is scaffolded using create-react-app command on node. The code files are in src folder.

File/Folder Description
App.css css Style file for UX.
App.js Application main file. Displays the input form, handle all user interaction and display results.
httpclient.js JS class for HTTP client for making XMLHttpRequest.
index.css React template css file
index.js React template index file
locationprovider.js JS class for providing user current location using bing maps api.
logo.svg logo for UX
searchresults.js View for rendering individual search results.
settings.js Configuration file for UX.
searchbutton.png Image for search.
star.png Image for rating star.
All other files and folders All other files and folders are generated by React template or is being used for deployment to Azure cloud.

Configuration

Configuration for UX is in file settings.js

Here is the description for all configuration settings

Configuration Data type Description
bingmapsApiUrl string Bing Maps API url for getting location data.
bingmapsApiKey string Bing Maps API key.
searchApiUrl string Url to sars API which we have implemented as part of this project.
searchUrlTemplate string Template for query parameters for API call.

UX workflow

UX has basically a form for capturing user input for query term as well as location information.

At start UX tries to detect user location using bing maps location rest APIs. When UX is loaded user can enter desired query and provide a location. User can then hit Enter or press the search button.

This triggers a call to sars API and when response from API is received the results are displayed in two columns. One for search results and other for recommendations from the system.

Installation

Clone this repository to a folder and ensure latest stable version of Node and npm is installed on system

Now let's install all required node_modules are installed.

node install .

Once all packages are installed then simply execute the following to get local UI running.

npm start

This kicks off the UX at port 3000.

That's it.

We have hosted our UI on Azure cloud at https://sars.azurewebsites.net/index.html .

Contributors