Adhoc search is one of the most common scenarios to satisfy users need. We want to build a search and recommendation system for user queries related to local business. The main goal was to fulfil the need of the user and satisfy the user query with corresponding results. The review data associated with the business is a rich information source that can be mined and used to infer meaning, business attributes, and sentiment. We will implement sentiment analysis for the reviews of the business and extract top positive and top not so positive reviews. We will scope our search and recommendation system to “Restaurants” segment. For example if user searches for pet friendly restaurant with good Indian food near Bothell, we should be able to suggest existing pet friendly Indian restaurants in Bothell.
We analyzed the data for local business and reviews from : https://www.kaggle.com/yelp-dataset/yelp-dataset
The goal was to pick one location which has medium size business review dataset. The date for state of Pennsylvania was a perfect fit for this project. This has around 10,000 local businesses with around 260,000 reviews.
The project is working end to end and an UX interface is hosted at https://sars.azurewebsites.net/index.html
The project is primarily composed of two parts.
- Topic modelling and Sentiment analysis.
- An API providing search capabilities and UX interface for demo.
Let's understand these components in detail.
This section covers the Topic modelling and Sentiment analysis pipeline of the project.
This module is responsible for carrying out data extraction, topic modelling and sentiment analysis.
Here are details of all code files and project structure
File/Folder | Description |
---|---|
extraction.py | This file extracts the core attributes from yelp dataset. |
generateTopicModel.py | This file generates tokenized review data obtained from data extraction pipeline of the project. |
sentimentAnalysis.py | The file generates the sentiment score for a review using VADER sentiment analysis. |
yelp_pennsylvania_businesswithreview_dataset.txt | This folder contains the Pennsylvania data from Yelp dataset. |
yelpReviewSentimentsScore.txt | This file contains the sentiment score generated by sentimentAnalysis.py. |
There are 2 seperate workflows for TopicModelling and Sentiment Analysis TopicModelling workflow calls generateTopicModel.py to generate the topics for a business Sentiment Analysis calls sentimentAnalysis.py to generate the sentiment score.
To execute run following commands on individual files
python generateTopicModel.py
python sentimentAnalysis.py
This section covers the API component of the project. API is responsible for indexing, ranking and search functionality of the project.
A search engine web API for indexing and searching review topics and providing location based results to callers. This API is used by Search and Recommender System - UX to display search and recommendations based on user inputs and location.
API is written in python and it runs on Flask web framework.
Here are details of all code files and project structure
File/Folder | Description |
---|---|
businessreviews | This folder contains the review data. |
./businessreviews/businessreviews.dat | This file has the business reviews tokenized data obtained from data extraction pipeline of the project. The data is formatted as line-corpus and each line is represented as JSON string. |
./businessreviews/line.toml | line.toml config file for MeTA. |
data | This folder contains the reverse lookup data. |
./data/yelp_pennsylvania_business_ recommendation_dataset.json | This file contains the reverse lookup data for each business. This is data is generated by extraction pipeline of the project and is in JSON format. This information is used in response from API. |
application.py | Python module defining Flask application and routes. It exposed one GET api named Search. |
application.test.py | Python module defining tests function. Super useful for debugging and running outside of Flask environment. This module is also used to generate inverted_index and posting files offline/upfront, without API being invoked. |
apptrace.py | Python module for collecting and printing trace logs. |
config.toml | MeTA configuration file. It defines the data file, analyzer settings, etc. |
controller.py | Python module for extracting request parameters, calling indexer, calling lookup and preparing API response. |
indexer.py | Python module for indexing, ranking and searching using metapy. |
lookup.py | Python module for loading and performing lookup by business Ids. |
requirements.txt | List all required python modules to be installed by pip. |
settings.py | Python modules defining application configuration settings. |
stopwords.txt | List Stopwords for MeTA indexing. |
Configuration for API is in file settings.py
Here is the description for all configuration settings
Configuration | Data type | Description |
---|---|---|
debugMode | bool | When set to True application collects logs. These logs can be retrieved in response by passing enabletrace in request query. |
lookupFilePath | string | Relative path to lookup file. |
maxSearchResults | int | Number of top items to fetch while searching. |
cfg | string | path to MeTA config file. It's a config.toml file. |
datasetKey | string | key for review documents and lookup data. |
bm25_K1 | float | OkapiBM25 K1 parameter |
bm25_b | float | OkapiBM25 b parameter |
bm25_k3 | float | OkapiBM25 k3 parameter |
useJsonExtraction | bool | When set to True application performs a JSON parsing on document retrieved after search. Document is required to be formatted as JSON string for this to work. |
Once the request is received by Search
method in application.py, the control is passed to Search
method of Controller
class.
Controller.Search
then extracts the parameters from request and load lookup and indexes
.
It then invokes queryResults
method of Indexer
class. queryResults
method instantiated a OkapiBM25
ranker and then score and ranks the documents based on query. Method then iterates over ranked document Ids results and builds a list of business Ids based on ranked list. The list of business Ids is then returned back to caller.
Controller.Search
then invokes method documentLookup
of Lookup
class. This method returns the business information based on business Ids passed in.
Controller.Search
then perform a filter on location data and splits the query results into searchResults
and recommendations
.
The searchResults
and recommendations
are then sent back as JSON response from API.
API HTTP request will look like
GET <server_endpoint_address>/v1/search?text=good+pizza&city=Pittsburgh&state=PA
Parameter | Description | Required |
---|---|---|
text | Search Query | mandatory |
city | City for query | optional |
state | Two letter State for query | optional |
enabletrace | If passed, response also contains application trace logs when server application debugMode is set to True |
optional |
API response fields
{
"searchResults" : [
{
"address": "422 Greenfield Ave",
"averageUserRating": "4.526315789473684",
"business_id": "2d9yZ11uVa83OEQWxe4vlQ",
"categories": "Pizza, Restaurants",
"city": "Pittsburgh",
"name": "Conicella Pizza",
"reviewCount": "38",
"sentiment": "0.8938315789473686",
"state": "PA"
}
],
"recommendations": [
{
"address": "3939 William Penn Hwy",
"averageUserRating": "4.5576923076923075",
"business_id": "X4QbkHl7pOTVsgLa7XuUBg",
"categories": "Gluten-Free, Restaurants, Pizza, Salad, Fast Food",
"city": "Monroeville",
"name": "Blaze Fast Fire'd Pizza",
"reviewCount": "104",
"sentiment": "0.8639048076923076",
"state": "PA"
}
]
}
We have hosted this application at following endpoint http://13.77.179.28
.
Feel free to call this API for example
GET http://13.77.179.28/v1/search?text=good%20pizza&city=Pittsburgh&state=PA
To host and run the application yourself, please follow through rest of this file.
We have used Ubuntu Server 18.04 LTS as our server. However, API should work on any stable distribution of Linux. Please ensure that you have following installed on your Linux machine.
-
Python3
- This comes by default in Linux
-
Nginx
- It is a very popular HTTP, reverse proxy server, and a generic TCP/UDP proxy server
All required python modules are listed in file
./sarsapi/requirements.txt
The modules can be classified into following
Area | Modules | Description |
---|---|---|
Python virtual env | wheel, venv | These are used for creating virtual environment. |
Python MeTA bindings | metapy, pytoml | Modern Text Analysis libraries |
Flask and related | Flask, itsdangerous, Jinja2, MarkupSafe, Werkzeug, flask-jsonpify, flask-restful, flask-cors | Flask web framework modules |
Web Server | gunicorn | WSGI HTTP server for Python apps.. metapy libraries are not compatible within a Flask development server and hence gunicorn is needed. |
Clone this repository to a folder and then open shell and change directory to folder sarsapi
Let's ensure pip is installed. On the shell prompt execute
sudo apt-get install python3-pip
Make sure pip and setuptools are the latest version
pip3 install --upgrade pip setuptools
Install python3-dev and python3-virtualenv
sudo apt-get install python3-dev python3-virtualenv
We installed virtualenv and pip to handle our application dependencies. Let's create virtual environment.
We will create the environment in current folder represented by . (dot) and activate it.
python3 -m venv .
source ./bin/activate
Our prompt will change when the virutalenv is activated and will look like this
(sarsapi) premp3@linux:~$
Now, lets ensure we have latest pip in virtual environment
pip install --upgrade pip
Now we will install all listed modules from file requirements.txt
pip install -r requirements.txt
When this command finishes, you will have all the required modules installed and we are ready to start the API.
Before starting the API, let's create the index files which will be used by API to perform search. This step saves time when API starts as required indexes and postings are created upfront.
File *application.test.py has the code to initiate creation of inverted index and perform a test search. Let's run this file in python
python ./application.test.py
When this execution is complete, we are ready to launch our API.
We could run our app with the Flask development server using the python application.py
command. However, metapy libraries are incompatible and hence we will run the Flask app with Gunicorn.
To start the API, execute
gunicorn -w4 application:app
Our API is now be running at address http://127.0.0.1:8000
Now, let's expose this API endpoint on port 80 using Nginx web server
Execute following to get Nginx installed
sudo apt update
sudo apt install nginx
After installation is complete, enable firewall rule for nginx
sudo ufw allow 'Nginx HTTP'
We are only allowing HTTP endpoints, if needed we can also both HTTP and HTTPS using Nginx Full
.
Now, lets configure nginx to proxy for our python gunicorn server running on port 8000
Execute following to edit Nginx configuration
sudo nano /etc/nginx/sites-available/default
Then edit the content as below.
server {
listen 80;
server_name _;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
Save the file and then restart nginx.
sudo systemctl restart nginx
Our API is now available on port 80 i.e. http://127.0.0.1
or whatever is the external IP or address.
This section covers the UX component of the project.
A UX is responsible for getting users query and their location and invoke API to get search and recommendations. This is the interface with which users can interact with the system.
UX is a single page web application implemented in React.
Here are details of all code files and project structure.
Project is scaffolded using create-react-app
command on node. The code files are in src folder.
File/Folder | Description |
---|---|
App.css | css Style file for UX. |
App.js | Application main file. Displays the input form, handle all user interaction and display results. |
httpclient.js | JS class for HTTP client for making XMLHttpRequest. |
index.css | React template css file |
index.js | React template index file |
locationprovider.js | JS class for providing user current location using bing maps api. |
logo.svg | logo for UX |
searchresults.js | View for rendering individual search results. |
settings.js | Configuration file for UX. |
searchbutton.png | Image for search. |
star.png | Image for rating star. |
All other files and folders | All other files and folders are generated by React template or is being used for deployment to Azure cloud. |
Configuration for UX is in file settings.js
Here is the description for all configuration settings
Configuration | Data type | Description |
---|---|---|
bingmapsApiUrl | string | Bing Maps API url for getting location data. |
bingmapsApiKey | string | Bing Maps API key. |
searchApiUrl | string | Url to sars API which we have implemented as part of this project. |
searchUrlTemplate | string | Template for query parameters for API call. |
UX has basically a form for capturing user input for query term as well as location information.
At start UX tries to detect user location using bing maps location rest APIs. When UX is loaded user can enter desired query and provide a location. User can then hit Enter or press the search button.
This triggers a call to sars API and when response from API is received the results are displayed in two columns. One for search results and other for recommendations from the system.
Clone this repository to a folder and ensure latest stable version of Node and npm is installed on system
Now let's install all required node_modules are installed.
node install .
Once all packages are installed then simply execute the following to get local UI running.
npm start
This kicks off the UX at port 3000.
We have hosted our UI on Azure cloud at https://sars.azurewebsites.net/index.html .
- Varun Kakkar - vkakkar2@illinois.edu
- Prem Prakash - premp3@illinois.edu