ASHEN: Area SearcH ENgine

Redisearch based full text fuzzy search engine

Description

This is an implementation of a dead simple in-memory search engine built with redis db and redisearch module. A fuzzy area-description dataset has been used here for demonstrating the process if indexing and querying data. However, it can be used as a quick template to build any sort of search engine where the entire indexed data primary lives in the memory and the query response needs to be performant. While performing queries, this implementation applies Levenstein distance based full text fuzzy matching. Also, it automatically backs up the entire index periodically in the ./redisearch-data folder and can be configured through the docker-compose.yml file. The entire stack consists of:

Running the Engine

Before running the engine, install docker and docker-compose on your machine.
Clone the repo and go to the root folder.
In the ./settings.toml file provide your internal ip as host = <your-internal-ip> under the production section.
Run
```
docker-compose up -d
```

Making Index

To make the engine functional, you will need to provide data in a specific format that will eventually be indexed by the engine. In this case, the area-description dataset looks like this. You'll find a sample dataset in the index-data folder. Your dataset should be named as area.csv:

index, areaId, areaTile, areaBody
0    , 1     , Azimpur , Example area in Azimpur
1    , 2     , Lalbagh , Some are in lalbagh
2    , 3     , Feni    , Sadar road, Feni

In the root folder, create a python 3.8 virtual environment, activate the environment and install the dependencies via running the following commands one by one:
```
python3.8 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
Place your data (should be formatted like above) in the index-data folder and run:
```
python -m index.insert_data
```
This should start the indexing process. It takes around a minute to insert one million key value pairs in redis.
You can explore your dataset by going this url. This opens up a RedisInsight dashboard:
```
<yourhost>:8001
```

Running Queries

Queries can be performed on the following POST API:
```
<yourhost>/area-search/
```

Header:

Content-Type: application/json
x-api-key: 1234ABCD

The payload should go as JSON:

{"query": "West Shaorapara,around Mirpur 10,\nShapla sharani.\nHouse no:438/3"}

Response:

{
"matchedArea": [
    {
    "areaBody": "House5,road1,block E,cholontica more,mirpur6,dhaka1216",
    "areaId": "315",
    "areaTitle": "Mirpur",
    "score": 48.0
    },
    {
    "areaBody": "House3 Road9 Block c Mirpur6",
    "areaId": "315",
    "areaTitle": "Mirpur",
    "score": 48.0
    },
    {
    "areaBody": "House3 Road9 Block c Mirpur6",
    "areaId": "315",
    "areaTitle": "Mirpur",
    "score": 48.0
    }
],
"query": "West Shaorapara,around Mirpur 10,\nShapla sharani.\nHouse no:438/3",
"verdictArea": "Mirpur",
"verdictAreaId": "315"
}

Architecture

.
├── app                       [flask-application]
│   ├── __init__.py
│   └── search_api
│       ├── __init__.py
│       ├── search_data.py
│       ├── utils.py
│       └── views.py
├── docker-compose.yml
├── Dockerfile
├── flask_run.py
├── index                     [This module should be run to insert new data]
│   ├── __init__.py
│   ├── index_data.py
│   └── insert_data.py
├── index-data                [Index module pulls data from here]
│   ├── area.csv
│   └── placeholder-area.csv
├── LICENSE
├── README.md
├── redisearch-data           [Redis back lives here]
│   ├── dump.rdb
│   └── placeholder.rdb
├── requirements.txt
└── settings.toml

Remarks

This application is built and tested on:

Python 3.8
Ubuntu 18.05
Redis stable 5.0
Redisearch 1.6.10
Flask 1.1.x
Pandas 1x